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PREFACE 


Interest in the field of parallel processing continues to climb. This trend is evidenced by the sharp increase 
in papers submitted to the International Conference on Parallel Processing during recent years: 


Papers Papers 
Year Submitted Accepted Percent 


1980 170 65 57 
1983 240 136 57 
1986 400 170 43 
1987 487 174 36 
1988 590 173 29 


Although the number of submissions continues to increase, the number of accepted papers this year and 
in the past two years has remained relatively unchanged. This is due to the limitation imposed by the fixed 
number of hours available for the conference. As a result, a record number of papers had to be rejected. This 
year, the conference proceedings is being published in three volumes according to the subject category. The 
breakdown of submissions and acceptances in the three main categories of this conference is as follows: 


Papers Papers 


Category Submitted Accepted Percent 
Architecture 264 4 28 
Software 144 43 30 
Algorithms and Applications 182 56 31 


Of the 173 papers that were accepted, 79 were accepted as regular papers and 94 were accepted as short papers. 
Many papers that normally would have been accepted as long papers were accepted as short papers in order to 
meet the maximum number of paper-sessions allotted for the conference. 

Finding sufficient numbers of qualified reviewers was a particularly challenging task this year, due to the 
record number of submissions. Over 1,000 professionals in the field participated in this process. This year 
the process of selecting referees was simplified by the use of questionnaires, which were mailed to previous 
participants in the conference. The information on the completed questionnaires were entered into databases, 
which then allowed the conference chairmen to select reviewers qualified in fairly specialized fields. Even so, 
numerous papers were so highly specialized that custom selection of referees was still required. It appears 
that an even more detailed breakdown of specializations will be needed for these questionnaires in the future. 
Greater effort will also be required in the future to find additional reviewers to adequately evaluate the increasing 
numbers of submissions. | 

I wish to thank the management of the Numerical Aerodynamic Simulation Systems Division at NASA 
Ames for providing me the opportunity to serve on the program committee this year. I also wish to thank 
the following persons on our staff who assisted in selecting referees and in handling the correspondence: Liviu 
Lustman, Martin Fouts, Julie Swisshelm, Horst Simon, Creon Levit, Gina Riley, Saundra Ramirez, and Reina 
Trinwith. I wish also to thank Prof. Tse-yun Feng for his support and encouragement in this effort. 


David H. Bailey 
NASA Ames Research Center 
Moffett Field, CA 94035 
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Abstract -- This paper defines and describes a broadly appli- 
cable approach to mapping of parallel computations upon 
multiprocessors, and briefly sketches the related mapping 
algorithms. The approach begins with a graph representation 
of a parallel computation and first generates a reduced graph 
by merging nodes with high internode communication cost 
through iterative use of a critical path algorithm. This graph 
is then mapped to a graphical representation of a multiproces- 
sor architecture by the mapping algorithms. These algo- 
rithms attempt to minimize the total execution time including 
both computation and communication times. The algorithms, 
while they are heuristic rather than true optimal algorithms, 
are shown to yield excellent results in example applications 
and have modest execution costs. 


1. INTRODUCTION 


This paper defines and describes a broadly applicable 
approach to mapping of parallel computation structures (con- 
sisting of mutually dependent schedulable units of computa- 
tions) upon MIMD multiprocessor architectures, and then 
sketches the related heuristic mapping algorithms. It also 
gives examples of the results obtained by application of the 
algorithms to different types of parallel computation struc- 
tures and different multiprocessor architectures. The algo- 
rithms are based upon the mapping of a graphical representa- 
tion of a parallel computation structure [4, 5] upon a graphi- 
cal representation of a multiprocessor architecture. In fact, we 
consider a series of transformations and mappings between a 
a a graph and an architecture graph as illustrated in 

ig. 1-1. 

The algorithms are described informally herein but com- 
plete formal definitions can be found in [17]. The algorithms 
apply to a broad class of graphs which can be derived from 
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Figure 1-1 General Overview of Our Approach 


programs with various types of loop structures, and to a wide 
class of architectures. The algorithms attempt to minimize the 
total execution time (computation time and communication 
time) of the parallel computation. Reduction of execution 
time is attained mainly by reduction of communication time 
by merging of schedulable units of computation. The first 
step of each of the algorithms is the reduction of the compu- 
tation graph to a virtual architecture graph through transfor- 
mations determined by iterative application of a critical path 
algorithm. This virtual architecture graph is then either 
transformed into another virtual architecture graph or mapped 
onto an abstracted graphical representation of a multiproces- 
sor architecture (called physical architecture graph). Note that 
a computation graph is assumed to have one root node and 
one leaf node without loss of generality. 


The algorithms are heuristics with modest execution 
cost. True optimal algorithms for the scheduling problem as 
stated in Section 2 are known to be NP-complete. Three 
applications are given: mapping of the Sieve of Eratosthenes 
to an Intel iPSC/5 [15], mapping of a Gaussian (forward) 
elimination to a Sequent Balance and mapping of a molecular 
physics code to an emulated Intel iPSC/5 configuration with a 
mixture of fast and slow processors and communication chan- 
nels. The results of the applications are surprisingly good. 
Near optimal tote! execution times are coupled with near 
minimal resource requirements and good workload balancing. 


This paper is organized as follows: After giving the 
problem statement in Section 2, we briefly review previous . 
work in Section 3. Then, in Section 4, after discussing our 
approach and the models for computation and architecture 
graphs, we explain mapping algorithms based on linear clus- 
ters. Section 5 gives a brief summary of performance results 
and Section 6 summarizes the status of the research. 


2. PROBLEM STATEMENT 


A parallel computation can be represented by a direct 
acyclic graph Gc = (Nc. Ec), where No = {4,2 °°°, 17} 
is a set of schedulable units of computation to be executed, 
and Eq specifies scheduling constraints and data dependen- 
cies defined on Nc. A multiprocessor architecture can be 
represented by an undirected graph Gp = (Np, Ep), where 
Np = {Py Pa **'> Pm} iS a set of processors, and Ep 
specifies interconnection topology among the processors. 
The basic problem is to find a mapping of Ge onto Gp which 
minimizes schedule length (or makespan) defined as: 

max | Pn (comp;+comm;;), 

where ) = {01,09,...,0,} represents a set of paths from the 
root node to the leaf node in Gc, node n; (assigned to proces- 
sor pyeNp (1Sysm)) is a direct descendant of node n; 
(assigned to processor p,eNp (1Sx<m)) in Ge, comp; 1s 
computation time of n;, and comm; is communication time 
from n; to n; (comm = 0, if py, = Py or n; has no direct des- 
cendants). 


An optimal schedule is one which meets the criteria of 
the minimum schedule length for a single parallel computa- 
tion structure or the maximum total throughput for a set of 


simultaneously executing parallel computation structures. It 
must integrate scheduling of computations and dependency 
relations to resources. An approach which integrates con- 
sideration of all the interacting factors is one which maps a 
computation graph defining the computation — structure 
(including the resource requirements for execution of each 
element of the computation structure) onto an architecture 
graph which defines the capability and capacity of the 
resource set of the execution environment. From here on we 
use the term task and schedulable units of computation, 
which corresponds to a node in a computation graph, inter- 
changeably. 


3. PREVIOUS WORK 


The problem of optimal scheduling (as defined in Sec- 
tion 2) of parallel computations upon multiprocessor architec- 
tures has received generous attention in the literature. Algo- 
rithms which yield true optimal solutions in the absence of 
resource constraints are well known to be NP —complete [12, 
21]. In fact, it is proven by Kim [17] that the other interesting 
scheduling problems are also NP-complete or worse in com- 
putation complexity. 


There have been many heuristic algorithms proposed in 
the past. Previous approaches have focused mainly on the 
development of specific mapping strategies for particular 
multiprocessor architectures. Some attempt to take advantage 
of the unique hardware characteristics such as interconnec- 
tion topologies of multiprocessor architectures under con- 
sideration. Since each strategy is usually an ad-hoc scheme, it 
is in most cases applicable to some limited class of multipro- 
cessor architectures (e.g., tightly-coupled homogeneous 
architectures [2], loosely-coupled homogeneous architectures 
[23], loosely-coupled heterogeneous architectures [11], or 
multicomputers connected in point-to-point fashion [6]). 

Various simplifying assumptions are common. For 
example, Bokhari [3] studies the assignment of tasks to pro- 
cessors with the restriction that the number of tasks should be 
less than or equal to the number of processors. Shen and Tsai 
[20] propose a graph matching approach for solving task 
assignment to processors, but ignore dependency relations 
among tasks. Some approaches have limited scheduling 
objectives; they find the best schedule with respect to either 
the total computation time [13] or interprocessor communica- 


tion time [14]. Other approaches are interested in balancing 


the workload of the total multiprocessor architecture [10, 22]. 


In most scheduling strategies for tightly-coupled archi- 
tectures, specific interconnection networks such as_ the 
Butterfly switch, the Omega network, the SW-Banyan net- 
work or a composition of them [19] are assumed. On the 
other hand, most research has not taken into account schedul- 
ing constraints, resource limitations, and/or the current work- 
loads of processors. It is frequently assumed that each pro- 
cessor is identical (.e., all have the same processing speed, 
equal number cf communication channels, and identical 
memory capacity). Finally, while most scheduling strategies 
make heavy use of busy-waiting as a synchronization 
mechanism, there is little attempt to reduce or avoid using it. 


All in all, there are a myriad of multiprocessor schedul- 
ing strategies which can be applied to specific multiprocessor 
architectures. On the other hand, there is little research 
which attempts an integrated approach to multiprocessor 
scheduling which could be applicable to various multiproces- 
sor architectures regardless of underlying architectural 
characteristics. 


4. APPROACH AND ALGORITHMS 


4.1. Approach 


One of the contributions of this paper is to propose algo- 
rithms based on linear clustering. A linear cluster is a con- 
nected subgraph of a computation graph which is in the form 
of a linear list of schedulable units of computation. Linear 
clustering is an effectual heuristic to compromise between 
two conflicting goals of multiprocessor scheduling, minimi- 
zation of interprocessor communication and maximization of 
potential parallelism, and to satisfy the other goals, 
throughput enhancement and workload balance, relatively 
well. The underlying idea of linear clustering is that the 
schedulable units of computation that are sequentially depen- 
dent on each other are to be assigned to one processor, while 
those that are mutually independent are to be allocated to 
separate processors. We select linear clusters on the basis of 
total execution time on an architecture with a processor for 
each node of the graph and a distinct communication channel 
per each edge of the graph. The critical restriction of linear 
clustering is that it expects a computation graph to be acyclic. 
To minimize this restriction, we identify cases in which 
cyclic computation graphs can be transformed into acyclic 
graphs in a straightforward manner [17]. 


A computation graph is transformed into a virtual archi- 
tecture graph (VAG ) by linear clustering. The VAG in fact 
represents an optimal multiprocessor architecture for the 
computation graph. The optimal architecture provides one 
processor to every linear cluster so that mutually independent 
tasks belonging to different linear clusters can be executed in 
parallel as long as possible. Furthermore, direct communica- 
tion channels are always available for any adjacent linear 
clusters in the optimal architecture. 


The VAG may be transformed into another VAG by 
merging two or more linear clusters into one cluster. Two 
linear clusters K , and Ky are combined into one if K» may 
Start only after K , finishes or may be executed only while K, 
is idle. It contributes to further balancing the workload of 
processors, and further reducing the amount of resources to 
be utilized and interprocessor communication overhead. 


After constructing a VAG which represents the optimal 
multiprocessor architecture for a given computation graph, 
we then find an optimal mapping of the VAG onto a physical 
architecture graph (PAG) which represents the target archi- 
tecture. This mapping is called a physical mapping as it is 
the final mapping of a computation graph onto a real physical 
multiprocessor architecture. We develop homogeneous and 
heterogeneous mapping algorithms for homogeneous and 
heterogeneous architectures, respectively. 


These algorithms rely on not only local information but 
also on limited global information. The key issue is how to 
reduce the mapping complexity while sacrificing as little 
optimality as possible. A dominant request tree is a maximal 
spanning tree of a VAG. It provides limited global informa- 
tion on the VAG such as the mapping order of the nodes and 
the edges whose adjacency should be maintained. Both map- 
ping algorithms utilize dominant request trees, but take quite 
different approaches to mapping the trees onto PAG’s. Most 
importantly, in the case of homogeneous mappings, the trees 
are directly mapped onto PAG’s. On the other hand, in the 
case of heterogeneous mappings, they are mapped onto dom- 
inant service trees. A dominant service tree is a maximal 
spanning tree of a PAG. For heterogeneous mappings, one of 
the important issues is how to identify and utilize resources 
with high performance. A dominant service tree provides 
such information. 


4.2. Model of Computation 

Browne [5] proposes a directed graph as a representation 
basis of a parallel computation, in which the nodes represent 
the bindings of operations to data and the edges represent 


dependency relations between schedulable units of computa- 
tion executed at the nodes. Our computation graph model is a 
triple (G., ff, fee"™), whose first component 
G. =(N., E,) specifies a parallel computation. Computation 
graph G, is a directed acyclic graph and defined as follows: 


(i) Anode setN, = (ny, 9, °°° my}; 


(ii) An edge set EF. = {€ 4, €9, , €; }, where any given 
edge é, = (nj, nj) is directed from node n; to node n fc 


To be specific, graph G, defines computation steps by 
the nodes and sequencing among the steps by the edges.. The 
remaining components provide information necessary for 
mapping the computation graph onto a target architecture. 
The second component f¢°"” is a function which maps each 
node in N,. onto a positive integer which is the expected com- 
putation time used by the schedulable unit of computation 
corresponding to the node. The next function f¢°”"" maps 
each edge (n;, n;) in E,, onto a nonnegative integer which is 
the expected amount of internode communication from node 
n; to node n;. For example, if fo°""(e,) = Noytes for ep = 
(n;,n; -), then the total length of messages sent from n; to n; 1s 
N bytes bytes. 

Our computation graph is a restricted model in a couple 
of ways. The critical restriction that makes the model inap- 
propriate for representing some parallel computations is that 
a set of edges entering and leaving a given node may not be 
joined by or conditions. The other restriction is that compu- 
tation graphs are required to be static; neither new nodes nor 
new edges can be created during runtime. The main reason 
for these restrictions is to avoid ambiguity in determining the 
computation and communication requirements of the nodes 
and edges in a computation graph. 


The model for architecture graphs provides a representa- 
tion basis for the structural description of multiprocessor 
architectures. We consider three types of resources: proces- 
sors, communication channels and memory. Our architecture 
graph model is also a triple (Gy, a Fa”), whose first 
component G, =(N, E,) is an undirected graph defined as 


follows: 

(i) An architecture node set N, = {an,,an>,.°-- , an;}; 

(ii) An architecture edge set E, = {ae,, aéo, --- , ae;}, 
where any architecture edge ae, (an;, anj;) 1s 
undirected. 


In an architecture graph, an architecture node represents 
a processor as well as a memory module, and an architecture 
edge represents a communication channel between two pro- 
cessors. The second component f7°”? is a function which 
maps each architecture node in N, onto a pair of positive 
integers which denote the level of computing power of a pro- 
cessor relative to the others in the architecture and the current 
local memory size. A common global memory may be 
specified by a dummy architecture node which is fully- 
connected with the other architecture nodes. The next func- 
tion fg 7 maps an architecture edge (an;, an;) in E, onto a 
positive integer which represents the bandwidth of communi- 
cation channel from an, to an, and vice versa. 


It is assumed that an ca tecene graph is static; the 
resource configuration of a physical multiprocessor architec- 
ture will not be changed dynamically during runtime. More- 
over, it maintains the exact current status of the architecture. 
The status includes the information on which processors are 
currently active/inactive, which communication channels are 
currently available and what is the current memory capacity 
available in each processor. , 


4.3. Mapping Based on Linear Clusters 
Clustering techniques have been used in a variety of 


areas in computer science [1, 7]. In this section, we propose 
a new mapping technique based on linear clustering and 
linear cluster merging. After discussing linear clustering and 
merging, we explain how to iteratively refine linear clusters 
(if necessary) for the minimization of schedule length. 


4.3.1. Linear Clustering 


Linear clustering is a fundamental idea of our mapping 
algorithms discussed in Section 4.4. A cluster of Gc = (Nc, 
Ec) is called a linear cluster K if it satisfies the following 
conditions: 


¢ K is nonempty; 
* K is a connected subgraph of Go; 


¢ Both indegree and outdegree of every node in K is less 
than or equal to 1. 


Linear clustering is a special case of general clustering in that 
a linear cluster is a degenerate tree in which each node has at 
most one direct ancestor and/or one direct descendant, while 
a cluster, in general, is an arbitrary graph. 


The following algorithm LinearCluster illustrates how 
to identify linear clusters: 


LinearCluster (G , K) 
/* G is a (cycle-free) computation graph. */ 
/* K isa set of linear clusters. */ 
Begin 
Let K = ©; 
Find a longest path P from the root to a leaf node in G; 
During traversing path P backward 
from the leaf to the root node, 
cut all the incoming and outgoing edges 
except the one belonging to P; 
For each connected subgraph S of G, 
If both indegree and outdegree of each node in S 
is less than or equal to 1, 
Then 
K=K US 
Else Do 
LinearCluster (S , K’); 
K=K )K’; 
End Do; 
End _ LinearCluster. 


A path (11, 25,..., n;) of graph Ge = (Nc, Ec) such that 
n; © Ne and (n;,(;41)) € Ec is considered the longest path 
if jt maximizes the sllowine function: 

d(o,-T, comp, a7 (1 —@;)° (@,'T, commas + (1- >): 2», Peommg, ame + @° Tons 

i=l jena; 

jw 

where Teomp, is the computation time of node n, (1<k<J), 
Tomm,, 18 the communication time of node n, with an adja- 
cent node n, (1Ss<i/ and 1<t</), Nod; denotes a set of nodes 
adjacent to n, (1<t</), and both @, and @, are normalization 
factors. 


4.3.2. Linear Cluster Merging 


In this section, we investigate a means to merge two or 
more linear clusters into one without affecting potential 
parallelism existing in a computation graph. It may contri- 
bute to further balancing the workload of processors. It may 
also contribute further reducing the amount of resources to be 
utilized and interprocessor communication overhead. 


The level numbers may be used to identify potential 
parallelism [18] in a computation graph if defined as follows: 


level (T)=1if T is a root node; 
= [max(/evel (A ) for each direct ancestor A of T )] 


+ 1, otherwise. 


Then, the same level number implies mutual independence. 
To be more specific, if a group of tasks have the same level 
number, they are mutually independent and may be simul- 
taneously executable. / 


In order to define conditions for merging linear clusters, 
let L; represent a set of level numbers assigned to tasks in 
linear cluster K;. Two linear clusters K; and K; are said to be 
sequentially strong —dependent if they satisfy the following 
conditions: 7 


2) The trailer node of linear cluster K; precedes the header 
node of linear cluster K;. 


Two linear clusters K; and K; are said to be mutually 
strong —dependent if they satisfy the following conditions: 

2) For two tasks T, and T, in K;, T, is a direct ancestor of 
T»4, where the former is one of direct ancestors of the 
header node of K; and has the largest level number 
among the direct ancestors, and the latter is one of direct 
descendants of the trailer node of K; and has the smallest 


level number among the direct descendants. 
If a pair of linear clusters satisfy any of the merging condi- 
tions, they Can be merged into one cluster without affecting 
the execution time of the computation. 


4.3.3. Iterative Refinement of Linear Cluster 


In the previous sections, we discussed how to transform 
a computation graph Gc into a virtual architecture graph by 
linear clustering and merging. It is expected that a linear 
cluster consisting of schedulable units of computation on the 
critical path of Gc takes the longest time to finish in the VAG 


in most cases. In this case, we can make use of the VAG for 


the mapping onto a physical architecture graph without any 
modification. This may not be true if the computation graph 
has extremely heavy communication requirements on edges 
not on the initial critical path. If that is the case, we may 
need to iteratively refine linear clusters in the VAG so that we 
can further reduce the total length of schedule prior to map- 
ping. It consists of two steps: 


_¢ Linear cluster labeling; 
e Linear cluster refinement. 


During labeling step, we label edges in a computation 
graph Gc = (Nc, Ec). The level number level,g,. of edge é;; 
= (n;,n;) may be defined as follows: 

level sige (€;;) = @:comp ;+(1-@)-comm;;tlevel roge (nj), 
where level, oge(n;) is the level number of node nj, comp; 
and comm; are computation time of n; and communication 
time from 7; to n;, respectively, and @ is a normalization fac- 


tor. Note that level, ae (n;) is defined as max (level edge (€ jx) 
REL; 

where D; is a set of direct descendants of node n j-. These 

edge labels allow us to identify the longest path to be con- 

sidered for the minimization of the total schedule length in a 

VAG. 


After linear cluster. labeling, we can determine if there 
are paths through a VAG, each of whose length is longer than 
the total computation time of a linear cluster corresponding to 
the critical path of the original computation graph Gc. If 
there exist such paths, we modify the current set of linear 
clusters in order to further reduce the total schedule length 
through iterative refinements of them. 


(a) (b) (c) 


Figure 4-2 Possible Refinements of Linear Clusters 


In Fig. 4-1, let us assume that a new longest path is pass- 
ing through nodes n, and n;, ie., the new longest path is 
(Sesh pli ). The basic idea of linear cluster refinement 
is to locate a cui edge (ij,n;) on the longest path and to 
reduce the length by merging nodes n; and n, (belonging to 
separate linear clusters) into one. After the two nodes n; and 
n; are merged, linear clusters shown in Fig. 4-1 can be refined 
as shown in Fig. 4-2. In Fig. 4-2-a, we merge n; and n, into 
one cluster, and cut the edges like (n : wn,) and (nz,n;) so that 
all the clusters remain as linear clusters. In Fig. 4-2-b and 
Fig. 4-2-c, however, we merge them, but leave one of the 
edges uncut while we cut the other edge. This type of 
refinement may force us to sacrifice some potential parallel- 
ism since two or more nodes (e.g., 2, and n; in Fig. 4-2-b, n; 
and n, in Fig. 4-2-c) executable in parallel are to be assigned 
to the same cluster. Nonetheless, it is worthwhile to merge 
two linear clusters in this way if internode communication 
overhead from n; to n, 1s larger than the schedule extension 
caused by sequential execution of nodes (e.g., nm; and n; in 
Fig. 4-2-b, n; and n, in Fig. 4-2-c), 


4.4. Mapping Algorithms 


The subject of this section is how to map a VAG onto a 
PAG. The important goal of our proposed algorithms is to 
compromise between two extreme approaches [8, 18] by 
reducing the complexity of the mapping algorithms while 
sacrificing their optimality as little as possible. For physical 
mapping, we need to take into consideration as much global 
information as possible during mapping. 


4.4.1. Dominant Request Tree 


The basic idea of our algorithm is to find a subgraph iso- 
morphism [12] from a VAG to a PAG which minimizes the 
total schedule length and satisfies given scheduling con- 
straints. We can easily show that the subgraph isomorphism 
problem is NP-complete, making use of the fact that the 
Undirected Hamilton Circuit problem is NP —complete . This 
fact forces us to rely on heuristics. We map each node of a 
VAG one by one in a sequential order. The key issue is then 
how to determine the mapping order which leads to the 
minimization of the schedule length. For this purpose, we 
propose another transformation of a VAG into a tree called 
Dominant Request Tree (DRT). This transformation can be 
done irdependently of the target architecture (i.e., whether it 


is homogeneous or heterogeneous). 


A DRT is a maximal spanning tree of aVAG. We con- 
struct the DRT starting from a node called the Most Dom- 
inant Node (MDN) rather than starting from an arbitrary 
node in the VAG. The MDN is that node n which maximizes 
the cost function defined as: 

OG) Bee + C=O) Tomas 
where T',,,,, 18 the computation time of n, Tm is the total 
communication time of n with its adjacent node(s), and @ is a 
normalization factor. The MDN is considered to be the most 
important node in the VAG in the sense that it represents a 
linear cluster which includes all tasks on the critical path in a 
given computation graph. Since it is usually the case that the 
MDN requires the largest weighted sum of computation and 
communication times among nodes in the VAG, we would 
Dae assign the MDN to the most appropriate processor in a 


Starting from the MDN of a DRT, we select a node with 
the highest binding power among unassigned nodes incident 
upon any already assigned node until all the nodes in the 
DRT are selected. The binding power of node n; with 
respect to an adjacent node n; is determined by: 

1D comp, og (1-0))"(2"T comm, + (1-@y)- 2 D comp y)> 
keA; 


kei 


where T comp, is the computation time of n,, Pcommi; is the 
communication time of node n; with node nj, and A; 
represents a set of nodes adjacent to n;. @, and @, are again 
normalization factors. A DRT of a VAG has two types of 
edges: the primary and secondary edges. The former are 
edges belonging to the DRT, while the latter are edges 
belonging to the VAG but not to the DRT. Note that the 
order in which each cluster is included in. the DRT deter- 
mines the priority list L. 


4.4.2. Homogeneous Mapping 


The goal of homogeneous mapping is to find a subgraph 
in a PAG to which a DRT of a VAG is isomorphic, relying 
on various heuristics like connectivity, exclusion, perturba- 
tion, foster mapping, and restricted pairwise exchange. The 
basic approach of the homogeneous mapping is to try to 
maintain adjacency of each node in the DRT with its neigh- 
bors as far as possible; whenever there is a direct primary 
edge from cluster K ; to cluster K 5, we choose processor Px, 


which has a direct channel from Px,. Note that Px and Kg, 


denote a processor onto which cluster K is to be mapped and 


the direct ancestor of K , respectively. 


Each node of the VAG is assigned to a processor in the 
order determined during transforming the VAG into the DRT. 
For each cluster K in the order of the priority list ZL, if there 
are at least two clusters which form full-connectivity with K , 
we first apply connectivity mapping. This heuristic attempts 
to maintain full-connectivity among clusters during mapping. 
If it is not successful to maintain the connectivity or there 
exist no clusters which form full-connectivity with K, then 
we try to assign K to a free processor in PAG adjacent to 
Px,,- During this mapping, we apply exclusion mapping to 
exclude processors in PAG which might be crucial to other 
clusters yet to be assigned. 


Next, we consider the case that Px, has no more free 


adjacent processors. Then, K may be mapped onto a proces- 
sor which is not adjacent to Py. For this case, we provide 


two heuristics: perturbation and foster mappings. In both 
heuristics, we first choose a processor which has the most 
appropriate numbe: of channels among currently unassigned 
processors. If there is more than one, we choose the one 
which is the nearest to Py,. Those unassigned processors 


should be adjacent to at least one processor to which a cluster 
has already been assigned. 


In perturbation mapping, we attempt to preempt a linear 
cluster which has already been assigned to a processor adja- 
cent to Px, , and to assign K to the processor. There are two 


possible cases that a linear cluster may be preempted after 
being assigned to a processor. First, an adjacent processor 
(say, Px...) of Px,, might be assigned to cluster Kg; which is 
not in fact adjacent to cluster K,, in the VAG. The other 
possible case is that all the clusters assigned to adjacent pro- 
cessors of Px, are in fact neighbors of Kg,, but K,4; might 


have less communication overhead with K,, than K in the 
VAG. 


As long as perturbation mapping does not make any 
improvement, it is not possible to maintain adjacency using a 
primary edge for this particular mapping. That is, cluster K 
can not communicate directly with cluster K,,. In order to 
lessen the effect of the indirect communication, we first 
check whether there is another cluster adjacent to K through 
a primary edge which has already been assigned to a proces- 
sor. If there is more than one, we choose a cluster Ky, which 
has the highest binding power (other than K,,) with K. After 
assuming K,, as a direct ancestor of K, we reiterate the same 
mapping procedure mentioned above (i.e., finding the best 
mapping from processor Px,,): We call such a mapping 
foster mapping. The only difference is that Ky, is now 
assumed to be the direct ancestor of K for Ky, in the VAG. 
If there does not exist such a primary edge, utilizing the 
secondary edges, we repeat the same procedure as we do for 
the primary edge. 

Since the previous heuristics do not guarantee an 
optimal mapping, we try to further improve the result by 
applying restricted pairwise exchange; we do allow random 
pairwise exchange of clusters to which specific codes have 
been assigned during mapping [17]. Note that we keep track 
of such codes based on how the clusters have been assigned 
during mapping. 


4.4.3. Heterogeneous Mapping 


Heterogeneous mapping is a mapping of computation 
graphs onto architecture graphs which represent heterogene- 
ous multiprocessors. For heterogeneous mappings, it is 
important to utilize resources with high performance as far as 
possible so that the total schedule length can be minimized 
and the workload balance can be achieved. We first need to 
distinguish resources with higher performance from those 
with lower performance. A Dominant Service Tree (DST) 
provides a limited amount of global information on resources 
in a heterogeneous multiprocessor lest our mapping algo- 
rithms become totally greedy based on local information. We 
can construct a DST by utilizing a maximal spanning tree 
algorithm. This may be considered as a transformation of a 
PAG into another PAG. In a sense, the transformation can 
be regarded as prescanning of architecture graphs prior to 
physical mapping. During the scanning, we collect informa- 
tion like which processors have more computing power and 
which communication channels have more bandwidth than 
others. 


After the transformation of a PAG into a DST, the 
scheduling problems for heterogeneous multiprocessor archi- 
tectures turn into the tree-to-tree mapping problems. The 
edges in the PAG are to be divided into two different types: 
the primary and secondary edges. Analogous to a DRT, the 
edges belonging to the DST are called the primary edges, 
while the edges belonging to the PAG but not to the DST are 
called the secondary edges. The main goal of heterogeneous 
mapping is to identify a mapping which maintains adjacency 
of the primary edges of the VAG with those of the PAG. 
When there are no primary edges available, however, we util- 


ize secondary edges of the PAG during mapping. Specific 
scheduling constraints (e.g., available local memory size) are 
also to be applied on the fly during the mapping. 


Since it is still an NP-complete problem to find an 
optimal mapping from one tree to another, the issue is how to 
develop efficient heuristic mapping algorithms between a 
DRT and a DST. We exploit sequential mapping order of 
nodes determined during constructing a DRT, and so-called 
node information [17] as a means to avoid exhaustive match- 
ing between two trees. 


5. APPLICATIONS 


The applications described here cover regular (Sieve of 
Eratosthenes, Gaussian elimination) and irregular (molecular 
physics code) computation graphs, and partitioned (Intel 
iPSC) and shared memory (Sequent Balance) multiprocessor 
architectures. 


5.1 Mapping of Sieve of Eratosthenes to an Intel iPSC 


We seek here decrease of the communication time com- 
ponent of the total execution time. The computation graph 
for the algorithm is shown in Fig. 5-1. The VAG for the com- 
putation graph is shown in Fig. 5-2. Fig. 5-3 shows the 
improvement in total execution time obtained by application 
of the algorithm together with the lower bound of total execu- 
tion time for this execution environment. 


Figure 5-1 Computation Graph 


Figure 5-2 Virtual Architecture Graph 
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Figure 5-3 Comparison of Total Execution Times 


5.2. Gaussian Elimination on a Sequent Balance 


The principal benefit to be obtained from application of 
one algorithm to scheduling for a shared memory multipro- 
cessor is decrease in overhead without loss of parallelism 
caused by an optimal selection of schedulable units of com- 
putation. The computation graph for forward elimination is 
shown in Fig. 5-4. Each node A; ; in Fig. 5-4 represents the 
row operation to force A; to zero. The VAG is shown in 
Fig. 5-5. The saving in overhead is shown in Fig. 5-6 for 9 


‘processors across a range of array sizes after linear clustering 


and merging. The gain is substantial (15%-20%) for larger 
array Sizes. 


Figure 5-4 Computation Graph 


greater resource requirements can be assigned to faster pro- 
cessors and channels of the heterogeneous one as far as possi- 


(sec) 


5.3. Modified Molecular Dynamics Code on a "Hetero- 
geneous" Intel iPSC 20 


The effects to be studied here are those of an irregular 
computation graph on a heterogeneous architecture. The 
computation graph is shown in Fig. 5-7 and the VAG in Fig. 
5-8. In order to obtain the effect of a heterogeneous multipro- 
cessor, we assume that 50% of processors and 20% of com- 
munication channels are twice as fast as real ones by setting 
computation times of nodes and communication times of 
edges in the VAG to % of their actual values if they are 
assigned to faster processors or channels, respectively. Fig. 
5-9 compares total execution times for four cases (x,y) where Pera Measured Theoretical Theoretical 

omogeneous) (Heterogeneous) (Homogeneous) (Heterogeneous) 
x = (homogeneous, heterogeneous) and y = (measured, 
theoretical). It is not surprising that the improvement in exe- 
cution time is greater for the heterogeneous architecture than 
for the homogeneous architecture since the clusters with 


ble. 
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Figure 5-9 Comparison of Total Execution Times (Unclustered vs Clustered) 


6. SUMMARY 


The conceptually simple and computationally tractable 
heuristics based on linear clustering have been found in appli- 
cation to be effective and, so far as can be judged by the lim- 
ited sample of applications, robust. 

Future work will include test of a large number of appli- 
cations, incorporation of various scheduling constraints into 
the model, and analytic definition of the class of graphs 
where the heuristics yield optimal schedules. 
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Abstract 


We develop efficient parallel algorithms for 
several river routing problems. These algorithms 
can be implemented on the CREW-PRAM model 
in O(log n) or O(log? n) time with O(n) proces- 
sors, where n is the size of the input. Our algo- 
rithms have fast implementations on other par- 
allel models such as the mesh or the hypercube. 


1 Introduction 


It is well-known that many of the optimization problems 
arising in VLSI routing are NP-complete (e.g. {KL],[L], 
[SB],[S]). One notable exception is the class of river rout- 
«ng problems associated with a hierechical layout strategy 
such as Bristle-Blocks([J]). See ({CS],[D et al],[LM],{LP], 
[M],[P],[SD],[T]) for more examples. In this paper, fast 
parallel algorithms for several river routing problems are 
presented. In particular, O(logn) or O(log? n) time al- 
gorithms with O(n) processors are developed for the sep- 
aration problem and for the routability problem around 
a rectilinear polygon ({P]). 

The above problems are considered in the CREW- 
PRAM model, which is characterized by the presence 
of an unlimited number of processors which can access 
a shared memory unit. Concurrent read is allowed while 
concurrent write is not. We are aiming for efficient paral- 
lel algorithms that run in o( 7), where p is the number 
of processors and T'(n) is the running time of the best 
known sequential algorithm with input length n. In the 
rest of the paper, we assume that the reader is familiar 
with some of the basic parallel techniques such as path 
doubling, parallel prefix, and the Euler tour technique. 
Our algorithms can be mapped into fixed-interconnection 
parallel architectures such as the array architecture or the 
hypercube. For example, all the algorithms stated in this 
paper can be implemented on a \/n x \/n mesh in time 
O(./n), where n is the input length. 
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The class of general river routing problems involves 
routing between ordered sequences of terminals such that 
the final layout is planar. One such problem is the wiring 
of two ordered sets of terminals {89,b,,...,b,-1} and 
{to,t1,..-,tn-1} across a channel between the parallel 
boundaries of two rectangles. The width of the chan- 
nel is the vertical distance between the two lines forming 
the channel. The separation problem is to find the mini- 
mum width of the channel necessary to wire all nets such 
that any two wires are separated by a unit distance. We 
will restrict ourselves to the case where the wires are rec- 
tilinear, i.e., there is a grid structure such that each wire 
consists of a set of grid line segments. Our methods gen- 
eralize for all the other known variations ([SD],[T]). 

A more general version of the river routing problem 
that is known to have an efficient serial algorithm is to 
perform planar routing where the ports lie on the bound- 
ary of a simple rectilinear polygon. In this case, we are 
‘nterested in whether the routing is possible or not and, 
if it is possible, we have to provide the detailed rout- 
ing. Several interesting subproblems such as finding the 
contour of the union of a set of rectilinear polygons or 
determining whether a set of nets can be wired within a 
set “passages” are also tackled. 


2 The Separation Problem 


Let {N; =< 6;,t; > | 1 <2 <n} be an instance of the 
channel separation problem. Notice that 6; and t; will 
be also used to denote the horizontal coordinates of the 
terminals relative to an arbitrary origin. A net N; is a 
right net if 6; < t;. If 6; > t;, then N; is a left net. 
Otherwise, it is a vertical net. We can partition the nets 
into right blocks, left blocks and vertical blocks. A set 
of right nets N;, Nizi1,..., Np is a right block if it is a 
maximal block with the property by < bpai < ty, for 
any 7 < k < p. We can similarly define left blocks and 
vertical blocks. : 

The wiring problem is reduced to wiring each block 
separately. We will concentrate on the wiring of right 
blocks. Obvious changes can be made to deduce the cor- 
responding algorithm for left blocks. 


The wiring of a net can be specified by the coordi- 
nates of its bend points. For example, net N, of Figure 
1 has the bend points Ay, By. For each net N;, we have 
2k bend points, Ain, Aja, sey Aik and By, By, iaiey Biz, for 
some k. Not all of these bend points are needed to de- 
termine the overall wiring. Let’s call A; and B;, (bend 
points closest to the bottom row) the characteristic bend 
points and all the others ordinary bend points. Notice 
that the characteristic bend points uniquely define the 
overall wiring since once we have the wiring of N;_, and 
the characteristic bend points Aj, and Bj, we can de- 
termine all the ordinary bend points of N; very easily. 
Figure 1 shows an example of a river routing problem 
and a wiring achieving the minimum separation. 


tito tg ty ts tg t7 tg ty fio tr bie bag bag 


bi bo bg = bg bg bg bz bg by bi 


bir byob a3 baa 


Figure 1: Basic river routing problem 


The algorithm to find the minimum separation is based 
on the following lemma. 


Lemma 1 Let N; be a net in a right block and let 7 be 
the minimum 7 <1 such that t; + (¢-j7 —1) > );. Then 
the coordinates of the characteristic bend points of N; are 


Aa = (b,i—j +1) and By = (t;+1-j,i-j +1). 


We now show how to compute in parallel the index 
3(z) for each 3. 

Algorithm Index 

input: A set of nets < 0;,t; >, 1 <2 <n, forming a right 

block. 

output: 3(7) such that 7(2) is the minimum j such that 

6; -—t; <:-—j —1, foreach 1 <2 <n. 

1. Compute 6; = 6; —2 and t) = t; —j —1 for each 2 and 


j. 
2. Sort the tis, say tp, Sty, S..- Sty. 

3. For each p;, determine f(p;) = min{p,|i < k <n}. 

4. Sort the 6s and the ts such that if a 6; = t, the 5; is 
pushed to the lower rank. 

5. For each 6j, let t,. be the closest t, > 0;. Then 


f(p;) =9(). 


Now we can find the minimum separation as well as 
the charactersitic bend points of all the nets by partition- 
ing the nets into blocks and by using algorithm Index and 
Lemma 1. 


Theorem 1 The minimum separation and the charac- 
teristic bend points of n input nets can be found in O(log n) 


time with O(n) processor on a CREW-PRAM. If all ter- 
minals lie in the range [1, N|, where N = O(n), then the 
running time is O(2 + logn) with p processors, for all 
1<p<n'§ (anye> 0). 


3 Routing In a Simple Polygon 


The routing problem of nets within a simple rectilinear 
polygon introduced in([{P]) is a generalization of the stan- 
dard river routing problem. In this case we are supposed _ 
to connect a set of terminals a1, d2,...,@, on the bound- 
ary of a simple rectilinear polygon to another set of ter- 
minals b;,62,...,6, on the boundary of the same poly- 
gon such that all the wires lie within the polygon and 
no two wires intersect. Routability testing is to deter- 
mine whether or not a one layer routing is possible and 
detailed routing is to specify the actual wiring of the n 
nets, if they are routable. We will restrict ourselves to 
the rectangle case. However all the algorithms can be 
generalized to any rectilinear polygon. 


Let N; =< a;,b; > be an arbitrary net. The terminals 
a; and 6; divide the boundary of a rectangle R into two 
parts. The part of smaller length will be called the zn- 
ternal boundary of N;. The other part will be called the 
external boundary. A net N; is covered by another net N; 
if the terminals of N; are in the external boundary of N,; 
and the terminals of N; are in the internal boundary of 
N;. A representative net is a net that is not covered by 
any other net. Figure 2 shows an example of a detailed 
routing problem such that N,, Ng and Ny4 are the rep- 
resentative nets. We can partition the nets into groups 
such that each group consists of a representative net and 
all the nets covered by it. The groups in Figure 2 are 
{Ni, No, N3, Na, Ns},{No, Nz, Ns, No, Nio, Nir, N12, M3}, 
and {Ni4, Nis}. One can show the following. 
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Figure 2: Basic river routing around a rectangle bound- 
ary 


Lemma 2 Suppose a given instance of the above problem 
7s routable. Then the routing can be performed by routing 
each group of nets separately. 


The general strategy for specifying the routing will 
be the following: (i) identify the proper groups , (ii) find 
the representative nets, and (iii) specify the routing of 
each group. By the parallel techniques of sorting, path 
doubling and prefix computation, we can create a chain of 
the nets involved in each group such that a representative 
net is a sink and the chains have following properties: 


Lemma 38 Let Nii, N,,,..-,Nr, be all the representative 
nets and let R(N,,) be the number of nets in the internal 
boundary of N,,. Then Sv*_,(R(N,,) +1) =n. Moreover, 
there exists a wiring strategy such that N,, has at most 


2(R(N,,) +1) bend points. 


Corollary: The total number of bend points of all the 
representative nets is O(n), where n is the number of 
nets. 


Lemma 4 Let n be the number of nets. Then all the 
groups and representative nets can be identified in time 
O(log n) with O(n) processors on the PRAM. With p 
processors, we obtain O(2 + logn), 1 < p< n'* and 
EU; 


We now turn to the problem of routing each group 
separately. Our goal here is to identify the bend points 
of each representative net. Let N =< z,y > beanetina 
group whose representative is N,. Let & be the number of 
nets between N and N,, including both N and N,. The 
bounding perimeter of rank k is the rectilinear boundary 
of the region determined by N such that the wiring of N, 
cannot lie inside it, i.e., this is the boundary of the region 
within the rectangle of all the points of distance < k 
of the rectangle boundary determined by N. Consider 
again the case of Figure 2. Let B,; be the bounding 
perimeter of rank k induced by net N;. Figure 3 shows 
the contours B33, Bs5, B22, Bo and By,. We claim 
that the following lemma is true. 


Lemma 5 The union of all the bounding perimeters of 
all the nets within a group determines the contour of the 
group and hence determines the wiring of the representa- 
tive net. 


To determine the union, flatten the rectangle into 
a line. Suppose a terminal p gets mapped into p. A 
bounding perimeter connecting p and q of rank k will get 
mapped into a simple rectangle with endpoints p and g 
and height k. Denote the mapped bounding perimeters 
by Ri, Ro,..., Ry). These rectangles determine a (union) 
contour R given by its extreme points. Then map these 
extreme points back to the original rectangle to get the 
wiring of the representative net. Few of these points 
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Figure 3: The union of all bounding perimeters 


around the corners may not be mapped into extreme 
points of the contour within the rectangle, but rather 
onto the boundary. These can be determined quickly 
and then eliminated. We are now ready to state the al- 
gorithm. 

Algorithm Contour 


Input: A group of nets with their representative. 
Output: The bendpoints of the corresponding contour. 


1. Determine the rank of each net, i.e. the number of 
nets between itself and the representative net. 

2. Determine all the bounding perimeters. 

3. Flatten the rectangle boundary into a line. Map the 
bounding perimeters into this line . Each corresponding 
perimeter can be identified by (9, q, k). 

4. Sort the triplets (p,q,k) according to k. For each &k, 
determine the union of line segments at distance k. 

5. From each line segment generated at step 4, determine 
the corresponding bendpoints. The overall contour can 
be specified by the bend points. 

6. Map the bend points of the contour on the line back 
into the rectangle. Eliminate those points within the 
rectangle which are not bend points. 


Lemma 6 If the number of nets in the group is n, then 
algorithm contour can be implemented in time O(log n) 
with O(n) processors. 


Theorem 2 Detailed routing of the representative nets 
of n nets within a simple rectilinear polygon can be done 
in time O(log n) with O(n) processors. With p processors, 
we have O( + logn), 1<ps n'-€ ande> 0. 


3.2 Routability Testing 


The problem may be unroutable for one of the following 
reasons: (1) The graph determined by the nets when 


restricted to lie within the rectangle is nonplanar. (2) 


The wiring of all the nets requires more area. 


Lemma 7 Whether the interconnection pattern of the 
given nets is planar can be determined in time O(log n) 
time with O(n) processors on the PRAM model. 


A single side net is a net whose terminals lie on the 
same side of the rectangle. If the terminals lie on adjacent 
sides then the net is called corner net. It is a cross net 
if the terminals lie on opposite sides. Partition the single 
side nets corresponding to each specific side into single 
side blocks such that each net except one (cover net) is 
covered by one or more nets in the block. Moreover each 
such block is maximal. A corner block is a maximal set of 
corner nets corresponding to the same corner such that 
each net except one (cover net) is covered by one or more 
nets within the block. Moreover no other net outside 
a block is covered by the cover net. For example, in 
Figure 2, N2 is a single side net, N; is a corner net and 
Ne is a cross net. The single side blocks are {N2, N3}, 
{Nz}, {Ni1, Mio} and {Niy4, Nis}, whose corresponding 
cover nets are N2,N7,Ni, and Mia. {Na,Ns}, {No, Nio} 
and {N13} are the corner blocks with N4,No, Ni3 as the 
corresponding cover nets. 

To decide whether the above blocks are routable, first 
determine the wiring of all the cover nets by algorithm 
Contour then check whether there is any intersection be- 
tween the wires of the cover nets. 


Lemma 8 Whether or not the single side blocks and the 
corner blocks can be wired within the rectangle can be 
determined in O(log n) time with O(n) processors. 


Once the block cover nets are wired, it should checked 
if there is enough space to route the remaining nets. Our 
approach consists of determining the wiring capacity and 
the wiring density between blocks. The wiring capac- 
ity between two blocks is the number of nets that can 
be wired between these two blocks, while the wiring den- 
sity is the number of wires that have to be wired between 
these two blocks. The capacity between blocks on two or- 
thogonal sidés of the rectangle boundary is computed as 
follows. Given a block B consider all the convex corners 
of B. Generate 45 degree “rays” from each such corner 
and determine the line segment where it intersects an- 
other block contour or the original rectangle boundary. 
Based on this information, one can determine the width 
of the narrowest passage between B and any other block. 
The details are given in the full paper. 


Algorithm Intersection 


Input: Contours of single side and corner blocks on two 
orthogonal sides of rectangle boundary. 

Output: Intersection points of rays emanating from con- 
vex corners. 
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1. Consider the case of the the lower right corner. The 
other cases can be dealt with in a similar fashion. Sort 
all the line segments determined by the block contours 


and the right side of the rectangle R. Determine the pro- 


jection of each line segment on the diagonal, say line 
segment 2 is projected into line segment p(z) on the diag- 
onal. 
2. Sort the projections according to their order on the 
diagonal and compute p’(i) = p(t) — USZ} p(t). 
3. For each ray y coming out of a corner of contour 
on the horizontal side of the original rectangle, find its 
intersection with the diagonal. If the intersection point 
lies in p’ (7), then ray y intersects segment 7. Determine 
the intersection point of ray y and line segment 7. 
4. If a ray y intersects the original rectangle boundary, 
then rotate to find the intersection with the next line 
segment belonging to some block contour (see Figure 4 
and ray yp ). Now determine the point of intersection. 
For example, one can check that in Figure 4 p'(CD) = 
C'D' and p'(EF) = D'F"’. Hence rays y, and yg intersect 
CD and EF respectively. If we rotate yg, we can find 
n with the next line segment GF. 


the intersecti 


Van ~ 


ie) 


Figure 4: Intersection between rays and block contours 


Lemma 9 Algorithm Intersection finds the intersection 
points of rays emanating from conver corners with the 
line segments of contours on two orthogonal sides of the 
bounding rectangle in time O(logn) time with O(n) pro- 
cessors. 


Use algorithm Intersection to compute the intersec- 
tion point of each ray with a single side contour, corner 
block contour or the original boundary of the rectangle. 
The capacity between blocks can then be calculated eas- 
ily. Then compare with the density between blocks to 
determine the routability between blocks. 


Lemma 10 Testing the routability of n nets between two 
orthogonal sides of a rectangle can be done in O(log n) 


time with O(n) processors on the CREW PRAM model. 


We now address the routability problem between two 
opposite sides of the bounding rectangle. The genera- 


tion of horizontal, vertical and 45 degree rays from each 
convex corner is not enough to determine the routability 
between two opposite sides. We will use a divide-and- 
conquer strategy to handle this case. 

Assume without loss of generality that all cross nets 
are between the top and the bottom sides. Select two 
adjacent cross nets N; and N; that split the nets almost 
evenly. Let N; be to the left of N;. (Figure 5) Find the 
temporary wiring of N; as close to the left as possible and 
the temporary wiring of N; as close to the right as possi- 
ble. Check whether any intersection will result. Repeat 
above procedure recursively for the cross nets to the left 
of N; and for the cross nets to the right of N; separately. 


Figure 5: Routability between two blocks in opposite 
sides 


Theorem 3 Testing the routability of n nets within a 
simple rectilinear polygon could be done in O(log? n) time 
with O(n) processors on the CREW PRAM model. 
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Abstract Consider a fully connected network of n > 3 
processes in which a process can send messages to a 
set of other processes simultaneously. Messages sent 
from a process to other processes simultaneously at time 
t are guaranteed to be delivered in the time interval 
[t+ 6,t + 6+] for some 6 and e, where e€ is a constant 
but 6 can vary and no upper bound on é is known. We 
show that, under this assumption, the clocks of the n 
processes cannot be synchronized any more closely than 
(1 + ay even if the clocks run at the rate of real 
time. A simple algorithm that synchronizes the clocks to 
within (1++)e is presented. The (1++)e upper bound on 
the imprecision of clock synchronization, together with 
the (1 — +)e lower bound found in the literature for the 
case in which both 6 and e are known constants, implies 
that whether or not there exists a given upper bound on 
the message transmission time becomes less and less sig- 
nificant when the number of processes increases. This is 
the first known solution for clock synchronization under 
unbounded message transmission time. 


1 . Introduction 


The problem of synchronizing clocks in a distributed sys- 
tem has been investigated under various assumptions. 
For example, in [7], Lundelius and Lynch considered the 
problem in an error-free system of n processes in which 
there is an uncertainty of ¢ in the message delivery time. 
That is, for some known constants 6 and €, a message 
sent by a process at time t is guaranteed to be delivered 
at the destination within the time interval [t+6,t+6+]. 
They show that, under this assumption, it is impossible 
to synchronize the clocks of n processes any more closely 
than (1 — +)e, even if all clocks run at the rate of real 
time. They also present an algorithm that achieves this 
_bound. Clock synchronization when processes and com- 
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munication links can fail has been studied extensively in 
[1] [2] [3] [5] [6] [8}. 

All clock synchronization algorithms reported in the 
literature [1]-[8] have been obtained under the assump- 
tion that an upper bound on the message transmission 
time is given. Although this may be a reasonable as- 
sumption in many practical situations, achieving clock 
synchronization when no upper bound on the transmis- 
sion time is available is interesting, not only from the 
theoretical point of view. 

In this paper we consider the problem of clock syn- 
chronization in a fully connected, error-free network of 
n > 3 processes in which a process can send messages 
to any set of processes simultaneously. We assume that 
if a process P sends messages to a set S of processes 
simultaneously at time t, then 


1. the messages addressed to the processes P’ € S 
such that P’ # P are received within the time 
interval [t + 6,t + 6 + e] for some finite 6 > 0 and 
€ > 0, where € is a constant but 6 can vary and no 
upper bound on 6 is known, and 


2. if P € S, then the transmission time of the mes- 
sage from P to P itself may not be related to those 
of the messages addressed to the processes P’ # P. 


That is, messages sent by a process to other processes 
simultaneously are delivered within a time interval of 
size €, but the message transmission times can be un- 
bounded. 

If messages sent by a process P simultaneously to 
a set S of processes such that P € S are all delivered 
within a time interval of size «, then clock synchroniza- 
tion becomes a trivial problem. It is conceivable, how- 
ever, that in certain systems messages sent by a process 
to itself are processed locally and delivered immediately, 
whereas messages sent to other processes are delivered 


more or less simultaneously when the communication 
channel becomes available after an unpredictable delay.! 
The model we consider can be a close approximation of 
such a system. 


It should be easy to see, at least intuitively, that syn- 
chronizing clocks without using an upper bound on the 
message transmission time is more involved compared 
to the case in which an upper bound is known. For ex- 
ample, in the algorithm of [7], a process which receives a 
message assumes that the transmission time of the mes- 
sage was exactly 6 + €/2, the average of the lower and 
upper bounds. If no upper bound is given, such a simple 
approximation is not possible. 

We show that, under the assumption. described 
above, the clocks of n processes cannot be synchronized 
any more closely than (1+ a CESV Ok for any n > 3, even 
if the clocks run at the rate of real time. The proof is by 
the “many scenarios” techniques [1] [7] used commonly 
for this purpose. Next, we present a simple algorithm 
that synchronizes the clocks of n processes to within 
(1+ +)e for any n > 3. The algorithm achieves optimal 
clock synchronization for n = 3 and is nearly optimal 
for n > 4. 

An interesting observation is in order. The (1 — +)e 
lower bound on the imprecision of clock synchroniza- 
tion proved in [7]-for the case in which the message 
transmission time is in the range [6,6 + e] for known 
constants 6 and €-increases and approaches e when n 
becomes larger. In contrast, the (1+ +)e upper bound 
obtained under the assumption of this paper decreases 
and approaches € when n becomes larger. This implies 
that whether or not there exists a given upper bound 
on the message transmission time becomes less and less 
significant when the number of processes increases. 


2 The Model 


Let P,, Po,...,P, be n > 3 processes. We assume that . 


messages sent by a process to other processes simulta- 
neously at real time ¢t are received in the time interval 
(¢ ++ 6,¢+ 6+] for some finite 6 > 0 and e > 0, where 
€ is known but 6 can vary and no upper bound on 6 is 


known. Other than this, the model we use is essentially 
that of [2] [7]. 


Process P; has a physical clock C; which is a real- 
valued function of real time. We assume that the phys- 
ical clocks run at the rate of real time and they cannot 
be reset by the processes; that is, C;(t) = C;(0) + ¢ at 
every real time t > 0. The processes have no access to 
the real time. 


For example, in a local area network consisting of sites running 
UNIX! connected by Ethernet, messages sent from a process to 
itself are routed through a local “loopback” interface whose delay 
is independent of the load of the Ethernet. 
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Destination ‘Transmission Time 


P, d+e 

P, d+(1——-5)e 
Pia gra ie 
Pris t die) 
Pia aU = Je 
Poy d+(1— *)e 


Py, d 


Table 1: Message transmission times from P; to other 
processes in €}. 


Process P; has a local variable A; (for adjustment) 
which provides the difference between the logical and 
physical clock times of P;. That is, the logical time L;(t) 
of process P; at real time ¢ is given by L,(t) = C;(t) + 
A,(t), where A,(t) is the value of A; at t. 

Following [2], we assume that a clock synchroniza- 
tion algorithm is a deterministic algorithm in which the 
state transition and the action of sending messages of 
process P; at real time t is determined only by the value 
of C;(t) and the message history of P; at t. Here, the 
message history of P; at real time t is the sequence con- 
sisting of tuples of the form < P;,m,T,y > for ev- 
ery message P; has sent or received before t, where 
< P;,m,T,y > represents that message m was either 
sent (y = sent) or received (y = received) to or from 
P; when the value of C; was T. An algorithm is said 
to synchronize the logical times to within vy if the algo- 
rithm eventually terminates, and when it terminates at 
real time t, |L;(t) — L;(t)| < y holds for any 7 $ j. 


3 Lower Bound 


In this section, we show that no algorithm can synchro- 
nize the logical times of n processes any more closely 
than (1 + Tene in our model. The proof is by the 
standard “many scenarios” techniques [1] [77]. 


Theorem 1 No algorithm can synchronize the logical 
times of n processes to within y, for any y < (1+ 


are) Je. 


Proof (Sketch) Fix an algorithm that synchronizes the 
logical times to within -. Let e; be an execution of the 
algorithm in which the transmission times of messages 
from P; to other processes are as given in Table 1, where 
d > (1+ —45)e is a constant. The transmission time of 
a message from P; to P; itself is an arbitrary constant. 
Since in e, messages sent by a process to other processes 
at real time ¢ are received within the time interval [t + 
d,t+d+ | of size €, e; is a valid execution. 


Consider another execution e2 which is obtained from 
e, by “shifting” [7] P, by (1+-—45)e. That is, e2 is iden- 
tical to e, except that 


1. at any given real time, the physical clock reading 
of P, in e2 is larger than that in e; by (1+ +5)e, 


2. the transmission time of a message from P, to P; 
(j #1) is increased by (1+ —5)e, 

3. the transmission time of a message from P; (j # 1) 
to P; is decreased by (1+ —5)e, and 


4. all state transitions and actions of sending mes- 
sages of P, take place earlier in e2 than in e; by 
(1+ —+5)e in real time. 


The execution e2 is valid, since all messages sent by a 
process to other processes are received within a time 
interval of size «. Similarly, for 2 <2 < n, we can 
obtain a valid execution e; from e;_; by shifting P;_1 by 
(1+ —)e. 

Now assume that in e,, the logical times of P,, Po, 
...>Pn are T1,T>,...,T,, respectively, at real time t+ 
when the algorithm has terminated at every process. 
By assumption we have 


Th <1 +4: 


Since during the execution of the algorithm each process . 


has the same message history in e; and e2 when its phys- 
ical clock has the same value, the values of A; computed 
by the algorithm are the same in e; and eg. Thus the 
logical times of P,; and P2 at ts in eg are Ty + (1+ — se 
and T>, respectively. Then by assumption we have 


1 


n—2 


T%+(1+ 


ye <To+y7. 
Similarly, for 2<1<n, the logical times of P;_; and P; 


at ty in e; are T;_1+(14+ ——)e and T;, respectively, and 
thus by assumption we have 


Tat (it 


1 
sEST +4. 


nm — 
By adding the n inequalities we obtain 


. 1 
Y BAT Sy © 


Oo 


4 A Simple Algorithm 


There exists a simple algorithm which synchronizes the 
logical times of n processes to within (1+ +)e for any 
n> 3. | 

The concept of “view” introduced below is essen- 
tial in describing the algorithm. Suppose that P, sends 
the message SNAPSHOT; to Pi,..., P:-1, Pi4i,..., Pn 
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simultaneously, and let v; (j # t) be the value of the 
physical clock C; at the moment SNAPSHOT; is re- 
ceived by P;. Then the n-tuple 


V = (v1, V2,..- 5 Vint —) Vita, --+5 Un) 


is called a view of P;, where ‘—’ represents “don’t care.” 
Since messages sent simultaneously by a process to other 
processes are received within a time interval of size €, the 
following lemma is immediate. 


Lemma 1 Let V = (v1, v2,..., Vi-1) —) Vit1)--+5Un) be 
a view of P;. There exist some real time t and a1, Q2,..., 
Qi-1, @i41,---,Qn such that for each; #1,0< a; <e€ 
and v; = C;(t) + a;. 


The algorithm can be divided into the following two 
phases. It is a straightforward exercise to represent the 
algorithm in any given language in such a way that its 
execution will eventually terminate at every process. 


Phase 1 Obtain a view 
V; = (vi, UG,25 +++ 9 Visi-1) 7, Viji41,--- ee 
of P; for each 1 <i <n. 


Phase 2 Compute Aj, A2,...,A, from Vi, V2,..., Vp as 
follows. For 1 <i<n, 


where for 1 < k,z <n, 


Dize= =r) i<i<n,i¢k (Vik — Vii) if k Ai 
* [0 if k =i. 


For k # 2, D;,,; is the average of the differences be- 
tween the physical clock readings of P, and P; observed 
in views V; such that | # k,t. A; is the average of D,; 
over all k, including D;; = 0. 

By Lemma 1, for each view 


Vi = (v;,1, Vij2) 06+ Visi-1) 5 Vitti) > + Vin): 


there exist ¢; and aj, @i2,..., Qii-1, Wii41,--- » Qin such 


that 0 < aj; < € and vj; = C;(t;) + a;; for j #7. In 


the following let ¢ be any real time when the execution 
of the algorithm has terminated at every process. 


Lemma 2 For1<i<n, 


L(t) = — Y Gl 


1<k<n 
1 
en ees a Sa 


(au,¢ — a1). 
n(n — 2) 1<k<n,k#i 1<I<n, I#k,i 


Proof Since C(t) _ C;(t1) = C;,(t) _ C;(t), for k #12 


- we have 

1 
n—2 
1 
n—2 


Dy, S> (unk — U4) 


1<l<n, |#k,i 


>> ((Cx(t1) + 1,4) 
1<l<n, If¢kyi 
—(Ci(t1) + a,)) 
C;,(t) — C;(t) + —.. > (Q1k — a ;). 


1<l<n, I#k,i 


Thus 


L(t) = Cit) + Ai(t) 
= C,(t) ++ >, Dai 


MM ick<n 


= G)+> YD (cx(t)- cx) 


1<k<n, kf1 
n(n — 2) 1<k<n,k#i 1<I<n,l£k,i 
1 
= — )o C(t) 


1<k<n 
1 


1 
(aK — O13) 


S> (ane — 1). 


n(n — 2) 1<k<n, k#i 1<l<n,l£k,i 


Theorem 2 The algorithm synchronizes the logical 
times of the n processes to within (1+ +)e. That is, 


ILi(t) — Lit) S (1+ Aye for any t Fj. 
Proof (Sketch) By Lemma 2, 


[,(t) — L;(t) 
1 


S> (aie — o1,) 


n(n — 2) 1<k<n,k#i 1<l<n,I¢k,i 


aa > me 


1<k<n,k#j 1<l<n,lfk,j 


(ak — 1,5) } 


1 
= — ~(X -Y 
n(n — ay! ) 

where 
X= DY ajyet(n-1) YO mzgt+(n-2)a;; 

1<k<n, k#i,j 1<i<n,lf#i,j 
and 
Y= > Qnk + (n _ 1) > aii + (n _ 2)a;4- 

1<k<n, k#i,j 1<I<n, I#i,j 


Since 0 < aj, < € for 14k, we have 


O<X,Y < {(n—2)+(n—1)(n—2)+(n—2)}e 
(n+ 1)(n — 2)e. 
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Thus 
I) -1,| < CP 
= (1+ =)e 
O 
5 Remarks 


Since the (1 + <2>5))e lower bound and the (1 + [)e 
upper bound sipeed in this paper coincide with each 
other if n = 3, the algorithm achieves optimal clock 
synchronization for n = 3. Closing the small gap of 
aenay’ between the two bounds for n > 4 remains as an 
open problem. 
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Abstract 


We consider the channel routing problem ofa 
set of two-terminal nets in the knock-knee model. 
The known strategy to handle this problem seems _ 
to be inherently sequential. We develop a new © 
approach to route all the nets within d tracks, 
where d is the density, such that the correspond- 
ing layout can be realized with three layers. Both 
the routing and the layer assignment algorithms 
have linear time sequential implementations. In 
addition, they both can be implemented on the 
CREW-PRAM model in O(log n) time with O(n) 
processors, where n is the number of nets. With 
1 < p< _n'~< processors, € any positive constant, 
the running time of the algorithms is O(% + log 7). 


1 Introduction 


The recent advances in the VLSI technology allow the 
fabrication of highly complex systems on single chips. 
Sophisticated software tools are needed to successfully 
design such systems. In particular, the routing phase is 
a critical and time-consuming part of the overall design 
process. Unfortunately, it turns out that most routing 
problems are NP-complete and hence no efficient solu- 
tions seem to be likely. There are few exceptions, how- 
ever. For example, various river routing (one-layer) prob- 
lems, the two-layer channel routing with no constraints, 
and few routing problems in the knock-knee model are 
known to have efficient solutions ({D et al],[MP],|O],[P], 
[PL]). Our goal is to develop a good set of techniques to 
obtain fast and efficient parallel routing algorithms. 


In this paper, we consider the channel routing prob- 
lem of two-terminal nets in the knock-knee model. A 
routing algorithm that uses d tracks, where d is the den- 
sity, is presented in ({[PL]) such that the routing can be 
realized with three layers. This algorithm can viewed 
as a nontrivial extension of the left edge algorithm ((O}) 
in which the routing is done row by row, left to right 

1Supported in part by NSA Contract No. MDA-904-85H-0015, 
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according to a greedy strategy. However, this method 
seems to be inherently sequential even for the case when 
each column has at most one terminal. We develop a 
novel strategy to obtain the optimal routing (which is 
in general different from the one obtained by the [PL] 
method) such that both the routing and the layer assign- 
ment algorithms have linear time sequential implemen- 
tations. Moreove, they are both fully parallelilzable in 
the sense that they can be implemented on the CREW- 
PRAM model in O(logn) time with O(n) processors, 
where n is the number of nets. If all the terminals lie 
in the range [1, N], where N = O(n), then these algo- 
rithms will run in time O(2 + logn) time with p < n*~* 
processors, where € is any positive constant. 


The rest of the paper is organized as follows. The ba- 
sic definitions needed for the rest of the paper are intro- 
duced in the next section, while in section 3 we develop a 
novel routing strategy and establish its correctness. The 
layer assignment algorithm is presented in the last sec- 
tion. 


2 Definitions 


We assume that the reader is familiar with the basic 
definitions related to channel routing (See for example 
[O],[PL]). In this paper, we restrict ourselves to two- 
terminal nets N = <t,b>, where t is the top terminal 
(on the top row) and } is the bottom terminal. t and } 
will also represent the integer displacements of these ter- 
minals relative to a fixed origin. N is a left (right) net 
ift < b(t > b). Otherwise it is a vertical net. We will 
also represent a net N as N = [I,r], where 1 < r, 
[= min{t,b} and r = maz{t,b}. We refer to | and r 
as the left and right terminals of N respectively. An in- 
stance of the channel routing problem (CRP) is a channel 
consisting of a rectangular grid and a set of nets whose 
terminals lie on the grid points of the (horizontal) par- 
allel boundaries. The local density d, at x is defined to 
be the number of nets [l;,r,;] such that l; < x < r;. The 


density d is given by d = maz,{d,}. A routing in the 


knock-knee model consists of a set of edge-disjoint paths 
(made up of gridline segments) connecting the terminals 
of each net. Hence a shared grid point could be one of 
two types: crossing and knock-knee (Figure 1). 


= 
- 


Figure 1: Types of shared grid points 


Let L,, L2,..., L, be a set of conduction layers stacked 
on top of each other such that LZ, is on the bottom and 
[,; is on the top. A wiring layout is an assignment of 
single layer to each routing segment such that (1) no two 
segments of two distinct nets share a grid point on the 
same layer, (2) a routing path may change layers at a via 
and (3) no wire can use a grid point on a layer which 
is between two layers with a via at that grid point. It 
is known that any routing in the knock-knee model can 
be realized with four layers ({BB]) and that three layers 
suffice for the channel routing problem ([PL]). 


Given a routing of an instance of CRP, the diagonal 
diagram can be obtained by inserting a diagonal for each 
knock-knee, a half-diagonal for each bend. If we remove 
the half-diagonals, we obtain the core layout. It is known 
that a wire layout can be realized with three layers if its 
core can [PL]. A partition grid is a grid containing all the 
diagonals (see [PL] for a formal definition). A set P of 
edges of the partition grid is called a legal partition if the 
following properties hold: 


1. Every internal vertex in incident on an even number 


of edges of P. 


2. The set of diagonals in P is identical to that of the 
diagonal diagram. 


3. None of the forbidden patterns in Figure 2 appear 
in P. 


A legal partition of a core layout W exists if and only 
if W can be wired with three conducting layers. 


We use the standard CREW (Concurrent Read Ex- 
clusive Write) shared memory model. All our results will 
be stated in this model. However, our algorithms have 
fast implementations on fixed-interconnection networks 
-such as the mesh or the hypercube. For example, all the 
algorithms stated in this paper can be implemented on 


a /n x \/n mesh in time O(./n), where n is the input 
length. 


MN ZN 


Figure 2: Forbidden Patterns 


19 


3 Channel Routing 


Given an instance of CRP of density d, our goal is to 
determine a wiring of all the nets in d tracks. In addition, 
the resulting layout or a slight modification of it should 
be realizable in three layers. 


The algorithm developed in [PL] constructs the wiring 
track by track by lying each track from left to right. The 
overall strategy can be viewed as a nontrivial extension of 
the line packing (or left edge) algorithm, where a mech- 
anism is provided to solve conflicts arising in columns. 
This approach seems to be inherently sequential even 
if there is at most one terminal in each column. Our 
method is quite different and consists of two main steps: 


1. Partition the nets into d chains satisfying certain 
properties to be outlined below. In particular, the 
nets in each chain define a set of nonoveralpping 
intervals. 


2. Assign a track number to each chain. Then wire all 
the nets simultaneously. 


We will outline how to perform each step next. The 
algorithm below creates chains of nets which will be mod- 
ified later to satisfy all the desired properties. We will 
denote the successor (predecessor) of a net N by succ(V) 


(pred(N)). 
Algorithm Create Chains 


Input: terminals /,;’s and r;’s of all the nets Ny, No,..., Nn. 


Output: d chains of nets, where d is the density of the 
corresponding channel routing problem. 


1. Mark all terminals as active. For each left terminal /; of 
a net Nj, find the nearest right terminal r; of some other 
net such that r; is to the left (or in the same column) of 
I;. If two such choices are possible, pick the one whose 
corresponding net is of the same type as Nj. Set p(l;) = 
r;. If no such r; exists, then set p(/;) =nil. Simularily, 
define p(r;) for each right terminal. 

2. If p(l;) =r; and p(r;) = &, then set succ(N;) = Mi, 
and mark r; and 1; as inactive. Create a reference point 
k between r; and {;. 

3. Let Ry, Ro,..., Rm be the intervals determined by the 
reference points. For each R;, create L(R;) consisting of 
all the active left terminals, and R(R;) consisting of all 
the active right terminals in FR. 

4. Find the corresponding terminal pairs in R(R;) and 
L(Ris1) and create links as before. Mark all terminals 
used as inactive and merge intervals Ro;-1 and FR; for all 
i. Repeat this step until there is one interval left. 


As an example, consider the channel routing instance 
of Figure 3. The chains produced by the above algorithm 
are given in Figure 4. We also have the following. 


4 $1237 6 8 13145 12 1110 20 17 18 16 21 25 15242319 22 


a DR 


i23 45 678 9 101112 13 141516 171819202122 2324 25 


Figure 3: A channel routing problem 


. Ny Nia > Nis — Nos 

. Ng — Nz > Ng — Mio — Nie — Nos 
. Ng > Ns — Ni2 — Mis — Nai — Naz 
N3 — Ne — Nis + Niz — Mio 

. Ng > Nii — Noo — N22 


ok; won 


Figure 4: The chains created by Algorithm Create Chains 


Lemmatl: The number of chains created by the above al- 
gorithm is exactly d, where d is the channel density. This 
algorithm can be implemented on the CREW-PRAM in 
time O(log n) with O(n) processors, where n is the num- 
ber of nets. 


Proof : Let R,, Ro,...,Rm be the intervals created by 
the above algorithm, prior to a set of merging operations 
of step 4, such that K; is the reference point between 
R;-1 and R;. Let n,,, nj, be respectively the numbers of 
active right and left terminals in R; and let nz, be the 
number of nets with terminals on different sides of K;. 


Claim: The following inequalities hold true before each 
set of merging operations performed in step 4 of the above 
algorithm: 

Nr, a Nias < d 


ni, tn, <d 


Proof of Claim: Notice that initially all active right ter- 
minals in R; must be to the right of the rightmost left ter- 
minal /; in R;. If at the completion of step 3, n,,+7k;4, > 
d, then the density of the channel at a point between the 
right and left terminals of Rj is > n,;-+nz,,, > d, which is 
impossible. Similarily we can establish the other inequal- 
ity. We now show that after each set of merging opera- 
tions, the inequalities will hold. Consider the merging of 
the intervals R2;_; and R2;. We know that n,,,_, +n; < 
d and nj, + rg, < d. Let c = min{ni,_,, Mr. }. We 
distinguish between two cases: 


1. Suppose that n),, > n,,,_,. Then the number of left 
terminals in the new merged interval Rj, is given by 
Ny = Nig, + Mg; — Mrg@_1 — € and hence n;,, +k, = 
Nig; Fly, — Nrg;_y T Ukgi_1 — C. But Nigj-1 Tk = 
Tro; +Nk,; and therefore nj,,+N4,, = Ng;+Nkp;—C S 


2. Suppose that nj, < n,,,_,. Then the number of left 
terminals in the merged interval Ry will be n,, = 
Nip;-, ~C and thus nj,,+Ng,, = Nig, +k —€ XS A. 
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In a similar fashion, we can establish the other inequality. 
This concludes the proof of the claim. 


Let d’ be the number of chains created by the above 
algorithm. Clearly, d’ > d. At the termination of the 
algorithm, the number of chains is equal to the number 
of left terminals. Using the claim above, we deduce that 


d' <d and hence d’ = d. 


We now establish the time and processor bounds. One 
can check that a couple of sorting steps and few simple 
operations will take care of step 1-3. Step 4 consists of 
O(log n) merging operations each of which can be done 
in O(1) time. 


The above chains can be used to wire all the nets in 
d tracks. However, the corresponding layout may not be 
realizable in three layers. We modify the above chains 
so that they have the following property. Let c be any 
column. Then either 


1. cis empty, or 
2. c contains one terminal, or 


3. c contains two terminals of nets N; and N;. Let 
NS =< c, b; > and N; = <tj,c>. 


e If both N; and N; are either right or left nets, 
then they both belong to the same chain and 
one is the successor of the other. 


e Suppose that N; is a right net and N; is a left 
net. The other case can be dealt with similar- 
ily. Let Nj = succ(.N;) and Ni = succ(Nj). 
Then they either share a column or the col- 
umn of Nj or Ni which is closer to c has only 
one terminal (see Figure 5(b)). 


Figure 5: Possible successors of two nets with right ter- 
minals in the same column 


The following algorithm outlines how to modify the 
chains so that the above property holds. 


Algorithm Modify Chains 


Input: A set of chains produced by the algorithm create 
chains. 
Output: A set of chains satisfying the property stated 
above. 


1. Mark each column with two right or two left terminals 
as active. 


2. For each active column c with a top right terminal t; 
and a bottom right terminal };, do the following: 


e If the left terminals of suec(N;) and succ(N;) are 
in the same column c’, then mark both c and c’ as 
inactive. 


e If the left terminals are in two distinct columns, 
say c’ containing the left terminal of succ(N;) is 
the left one, then mark c inactive if c’ has only one 
terminal. 


e Otherwise, c’ contains another left terminal b,. Let 
N, = pred(N;). Then create the pair < N;, Ny, >. 
Mark c and c’ as inactive. 


3. Group the pairs < N;,N, > into maximal groups < 
Nzo, Nx 4's Nx, Nr Py aes Net-1, Nr >. Update 
the successors of these nets by setting the new successor 
of N;,; to be the previous successor of Nziqi for all 0 < 
~<t—1. In addition, set the new successor of Ni; to be 
the previous successor of Nyxo. 

4, Repeat procedure for active columns with two left 
terminals. 

5. Adjust chains in such a way that whenever the con- 
figurations of Figure 5(a) occur, they will be replaced by 


the corresponding configurations of Figure 5(b) (similar- 
ily for columns with two left terminals). 


As an example, consider the chains of Figure 4. Then 
the above algorithm creates the new set of chains given 
in Figure 6. 


- Ni + Ng — Nio + Nig > Nig 

. Ng — Nz + Ng — Nii — Nao > No 
- Nz > Ns > Ni2 > Mig > Noi + Now 
- N3 + Ma — Ms — Noe 

. No — M3 — Niz > Nos 


o me © NN pe 


Figure 6: New chains generated by Algorithm Modify 
Chains 


Lemmaz2: The above algorithm modifies the chains gen- 
erated by the algorithm Create Chains such that the new 
chains satisfy the desired properties. Moreover, the algo- 
rithm runs in O(log n) time with O(n) processors on the 


CREW-PRAM model. 


Proof: To simplify the presentation we will introduce a 
new graph called the link graph. There is vertex v, corre- 
sponding to each column c. There is an edge between v, 
and v_ if and only if c contains a terminal of a net whose 
successor or predecessor has a terminal in c’. Notice that 
the link graph of each of the groups created in step 3 has 
the form shown in Figure 7(a). If cj has another link to 
a, then a cannot appear between co and c;. After the 
modifications performed in step 3 the link graph of the 
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group will be of the form given in Figure 7(b) with 2 
link loops or paths of length 2. Hence it is clear that 
after step 3 no column with two right terminals could 
cause any problem. Each group may have generated one 
column with two left terminals which donot satisfy the 
desired property. Then step 4 of the above algorithm 
takes care of all these columns (Figure 5). Step 5 insures 
that columns with two terminals will be of the form given 
in Figure 5(b). The time and processor bounds of the al- 
gorithm can be easily established. 


(b) 


(2) 


Figure 7: Forms of groups in the proof of Lemma2 


The track assignment and the wire layout will be de- 
scribed next. Suppose that track k has been assigned to 
net VN = <t,b>. Then the wire of N will consist of the 
interval [¢;, b,] on track k, a vertical line segment from 
b to by, and a vertical line segment from ¢ plus a pos- 
sible detour to ¢,. Therefore the problem comes down 
to determining how to connect a terminal on the upper 
row down vertically to its track. The algorithm below 
describes how to achieve this. 


Algorithm Wire Nets 


Input: A chain of nets as modified by the algorithm Mod- 
ify Chains. | 
Output: A wire layout for each net. 


1. For each chain, assign the leftmost terminal J; as the 
primary key, and, if J; is a bottom terminal, assign 0 
as the secondary key and 1 otherwise. Sort the chains 
according to their keys. The track number of each chain 
is its corresponding rank. 

2. For each column c, do the following: 


1. if e contains one terminal of a net N, then connect 
that terminal vertically to the track of N. 


2. Suppose c contains two terminals of a single net. 
Then connect these two terminals vertically. 


3. Suppose that c contains two terminals of two dis- 
tinct nets N =< c,b > and M =<t,c>. If N and 
M have the same track number, then wire the ter- 
minals to this track using a knock-knee. Otherwise 
there is detour only if the track number of NV is less 
than that of M. In this case, it is a left or right 
detour depending on whether c is a right or left ter- 
minal. The detour extends to either to the column 
of successor (for a right detour)or predecessor (for 
a left detour)of either N or M whichever is closer. 
All the cases that can arise and the corresponding 
routing are shown in Figure 8. 


pec | M [ [ mw re rs M 
ors ~] Mon whe M me | N cs 
a) (2) (3) @ 
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Figure 8: Possible detours of nets with terminals in the 
same column 


Consider the example of Figure 2 again. Then the 
routing obtained by the above algorithm is given in Fig- 
ure 9. 


f ’ 
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Figure 9: (a) by Algorithm Wire 
Nets, (b) its corresponding diagonal diagram and (c) its 
corresponding constraint graph 


rl 
The layout generated 


Lemmas: Given an instance of the channel routing prob- 
lem, the above algorithm provides a legal routing of all 
the nets in the knock-knee model. 


Theorem1: Given an instance of the channel routing 
problem of density d, it is possible to wire all the nets in d 
tracks in time O(log n) time on the CREW-PRAM model 
with O(n) processors, where n is the number of nets. If 
all terminals lie in the range [1, N], where N = O(n), 
then the above algorithm can be implemented in O(n) 
sequential time and in O(% + logn) parallel time with p 
processors on the CREW-PRAM model, where p < n!~¢ 
and € is any positive constant. 
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Proof: The first statement of the theorem follows from 
the previous lemmas. If all the terminals lie in the inter- 
val [1, N], N = O(n), then sorting (most expensive step) 
takes O(n) sequential time. For the parallel implementa- 
tion, the most expensive steps are sorting and traversing 
_ lnked lists. Using the results of ({K et al]) we obtain the 
bounds stated in the theorem. 


4 Layer Assignment 


In this section, we show that a modified version of the 
routing produced by the algorithm of the previous section 
can be laid out in three layers. [PL] provides a necessary 
and sufficient conditions for the realization of a wiring 
in three layers. As stated in section2, the problem is es- 
sentially reduced to finding a legal partition of the core 
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of the diagonal diagram. The routing layout produced 
by the algorithm in [PL] has a special property, namely 
every column is either empty or contains one diagonal or 
a diagonal \ on the bottom and a diagonal / above it. 
Their algorithm proceeds from left to right, looking at 
each column and making vertical connections (and possi- 
bly changing the routing) so that the resulting partition 
is legal. Unfortunately, we encouter a major difficulty in 
our case. Each column of our routing layout could have 
two diagonals (\ and / ) in an arbitrary order (because 
our routing uses left and right detours). This makes it 
necessary to change the wire layout much more substan- 
tially than was done in [PL]. In the rest of this section, 
we outline how to overcome this difficulty. 


By adding dummy diagonals if necessary, we can as- 
sume that each column is either empty or contains ex- 
actly two diagonals. As in [PL], our partition will be 
constructed by adding vertical edges only. Define a ref- 
erence line as a vertical line that touches the endpoint 
of some diagonal. For each reference line, the diagonals 
touching this line will partition it into several line seg- 


ments. Number these line segments starting from the top 


most segment. Notice that there are two possible ways 
of adding vertical segments (to create a legal partition): 
add the odd-numbered or the even-numbered segments. 
We have to choose (if possible) those segments that will 
not create a forbidden pattern. 


We define the constraint graph as follows. The two 
possible choices of vertical segments corresponding to ref- 
erence line L; are represented by two vertices v2;-1 and 
vq; ‘Two vertices are connected by an edge if and only — 
if the corresponding choices create a forbidden pattern. 
Notice that forbidden patterns can be created only be- 
tween adjacent reference lines. 


Lemmadé4: The total number of the edges between the 
vertices corresponding to adjacent reference lines is < 2. 


Proof: Since the maximum number of diagonals between 
two adjacent vertical reference lines is 2, there are at most 
two “constraints” between {v2-1, vai} and {vai41, vase}, 
for each 7. 


Our goal is to pick for each reference line one of its 
vertices such that no two such vertices are connected by 
an edge. This may not be possible, in which case the 
routing layout has to be modified. We introduce the pat- 
terns that can create potential problems. A forbidden 
column is a pair of vertices corresponding to a reference 
line such that no selection of its vertices will lead to a | 
legal partition. The set of configurations that may give 
rise to a forbidden column are shown in Figure 10. 

Our goal is to modify the wiring layout if necessary 
so that the resulting constraint graph has no forbidden 
columns. We start by showing that any such graph will 
lead to a legal partition. The following algorithm shows 
how to select the proper set of vertices. — 


Pie, Ge Ce Oe aS 
POS 2 eee Bee Zee Bes 


Figure 10: Configurations that may give rise to forbidden 
columns 


Algorithm Select 


Input: Reference lines and the corresponding constraint 
graph with no forbidden columns. 


Output: A subset of the vertices which will induce a legal 
partition of the wiring layout. 


1. Mark all reference lines as active. For each reference 
line L;, select v2; (vai-1) if vas_1 (v2;) is incident on two 
edges to a single adjacent column. If such a selection is 
made, mark L; as inactive and assign weight 0 if v9; is 
selected, otherwise assign weight 1. 

2. Create a sorted list for each set of active reference 
lines between two inactive reference lines. 

3. For each list created in step 2, do the following. As- 
sign a weight 0 to each line L, in the list if there is an 
edge between v2,_3 and va, or between voz_2 and vop_1. 
Otherwise, assign a weight of 1 to Lx. 


4. Calculate the rank of each reference line. Then select 
Vox if the rank of Ly is even; otherwise select v,_1. 


Lemmad: Given a partition graph with no forbidden — 


columns, Algorithm Select will generate a subset of the 
vertices that determine a legal partition of the wiring 
layout. 


Proof: Let’s start by observing that the selection made 
in step 4 for inactive reference lines is consistent with 
that of step 1 because the graph contains no forbidden 
columns. For the rest of the proof, it is enough to show 
that there is a selected vertex for each reference line such 
that no two selected vertices are connected by an edge. 
The algorithm clearly selects exactly one vertex for each 
reference line. Suppose that there is an edge between two 
selected vertices, say v2, and vz_2. Then the weight of 
Ly, must be 0 (because both have even ranks). But then 


either vo, is connected to vg,-3 or V2~-1 1s connected to 


Vop—2- In the first case, voz; would have been selected; 
in the second case, v2.43 would have been selected. Sim- 
ilarily we can handle the other cases. Notice that the 
selection made in step 4 for inactive reference lines is 
consistent with that of step 1 because the graph contains 
no forbidden columns. 


In the rest of this section, we will show how to modify 
the wiring in such a way that the corresponding con- 
straint graph has no forbidden columns. We first in- 
troduce the following classification of reference lines (cf 
[PL]): Trivial (Figure 11), Overlap (Figure 12), Disjoint 
(Figure 13), Inclusion (Figure 14). Each type is shown 
with its possible constraint graph. The only possible for- 
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bidden columns could come from: D,, D3, Dg, Dg, Ie, Ls, 
Ig, Ig. In most of these cases, the wiring has to be modi- 
fied by adding diagonals in such a way that no forbidden 
column could possibly arise. The procedure involves a 
detailed case study which is summarized by the follow- 
ing algorithm. 
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Figure 11: Trivial reference lines 
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Figure 12: Overlap reference lines 
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Figure 14: Inclusion reference lines 


Algorithm Modify 
Input: Wiring layout produced by Algorithm Wire Nets. 


Output: A new wiring with its modified constraint graph 
and a set of selected vertices. 


1. Generate the diagonal diagram, delete all half diago- 
nals and add necessary dummy diagonals as follows. If 
there exists exactly one diagonal \, then add a dummy 
diagonal / in an additional row above all the rows. If 
there exists exactly one diagonal /, then add a dummy 
diagonal \ in an additional row below all the rows. De- 
termine the constraint graph and mark all reference lines 
which may give rise to forbidden columns as active. 


2. Handle type Jy active reference lines as follows. Let 
L;, Lj-2,..., Lj-2» be a maximal chain of active I2’s. We 
want to modify every other L; starting with L; in a way 
that depends on the type of its left neighbor L;_,. All 
the cases that can arise are shown in Figure 15 with the 
corresponding modifications. In each such case, a ver- 
tex of [;_1 is selected (its degree is 0), edges between 
reference line L;_, of selected vertex and its neighbors 
removed and the reference lines L;, L;-1, [;-2 are marked 
inactive. Handle type J¢ reference lines in a similar fash- 
ion. 


3. Handle type active I, as shown in Figure 16. Select 
vg; and remove edges between LD; and its neighbors. Mark 
L;, L;-1, Li41 as inactive. Handle type Ig similarily. 


4. Handle active type D, as shown in Figure 17. Select 
Vgi-1 and remove edges between L; and its neighbors. 
Mark L;_,, L;, Li41 as inactive. In Figure 18 a maximal 
chain of D,’s is considered. D;, Li41,..., Lx are all of type 
D,. Tf L; or Ly can give rise to a forbidden column, then 
modify as shown and remove all edges of D;— Dy. All the 
odd vertices of of L; — Ly are selected. As before edges 
are removed for selected columns and adjacent reference 
lines are marked inactive. Repeat the same procedure for 
types D3, Deg and Dg. 


Figure 15: Transformations on type Jy reference lines. 


Lemmaé6: Algorithm Modify will change the wiring lay- 
out produced by Algorithm Wire Nets in such a way that 
the corresponding constraint graph contains no forbidden 
columns. | | 


Proof: Consider the original constraint graph in which 
L; was of type Iz (hardest case). Then we have to show 
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Figure 16: Transformations on type I, reference lines 
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Figure 17: Transformations on type D, reference lines 


that L;-3 will create no problems. The only nontrivial 
cases are the following: 


1. L;-3 is of type Jz. In this case the algorithm selects 
vertices in the columns corresponding to L,;-1 and 
L;-4 and hence there are no edges left between Lj-2 
and L;-1, and between L;-3 and Lj-4. 


9. L;_3 18 of type Ig. Suppose that there are no dummy 
diagonals between L;_3 and L;-2 or between L;-1 
and L;. The only possible wiring configurations 
are shown in Figure 19 with their corresponding 
diagonal diagrams. If there is a dummy diagonal 
between. [;_; and L;, then we can have one of the 
three possibilities shown in Figure 20. In each of 
these cases, one of DL; or L;-3 cannot generate a 
forbidden column. 


Os Lj-3 is of type 4, Tg, Dy, Ds, De or Ds. One can 
check that none of these cases can possibly generate 
a forbidden column. 


The remaining cases can be dealt with similarily. 


If we go back to the example of Figure 2, then the 
routing produced by the algorithm of the previous section 
is given in Figure 9. The layer assignment algorithm will 
change the wiring of Nig and No (Figure 21) and the 
final layout is shown in Figure 22. 


SA 


Figure 18: Maximal chain of D,’s. 
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Figure 19: Possible wiring configurations for case 2 of 
lemma6 
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Figure 20: Possible configurations with dummy diagonals 
between L; and Lj. 


(2) 
Figure 21: Changes in the wiring of Nig and No 


Theorem2: Given an instance of the channel routing 
problem, it is possible to determine a three-layer assign- 
ment of the routing layout in time O(logn) time with 
O(n) processors on the CREW-PRAM model. If all ter- 
minals lie in the range [1, NV], where NV = O(n), then the 
above algorithm can be implemented in O(n) sequential 
time and in O(4 + log n) parallel time with p processors 
on the CREW-PRAM model, where p < n!~‘, and € is 


any positive constant. 
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Figure 22: (a) The final layout after the modification of 
layer assignment algorithm, (b) its corresponding diago- 
nal diagram and (c) its corresponding constraint graph 
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PARALLEL ALGORITHM FOR MINIMUM DUAL-COVER 
WITH APPLICATION TO CMOS LAYOUT 


Y. M. Huang and M. Sarrafzadeh 
Department of Electrical Engineering and Computer Science 
The Technological Institude 
Northwestern University 
Evanston, IL 60208 


Abstract — In a pair of planar graphs (G,G?), with 
G? being dual graph of G, a sequence of distinct edges is 
a dual-Euler trail if it is a trail both in G and in G4. A 
set of disjoint dual-Euler trails that simultaneously cover G 
and G? is called a dual-cover. We present an O(log n) time 
and O(n) processors algorithm, in PRAM model, based on 
the graph separator theory, for obtaining a minimum cardi- 
nality dual-cover in a pair of series-parallel graphs (G, G4), 
where n is the total number of edges. We employ the pro- 
posed algorithm to obtain a minimum-area VLSI layout of 


CMOS functional cells. 


1 Introduction 


Algorithm design is the development of better proce- 
dures and data structures to reduce the time to solve a 
given problem on a given computing system. Exploitation 
_of a multiprocessor system requires a radical departure from 
the traditional Von Neumann environment. Detection of 
parallelism in sequential programs is essential to the disci- 
pline. 

In the parallel-random-access memory model (PRAM) 
there is a group of processors, with access to a shared mem- 
ory, cooperating to solve a given problem. An effective al- 
gorithm in PRAM model should aim to minimize the com- 
putation time and the number of processors. 

Consider a planar graph G = (V, E) along with its dual 
graph G¢ = (V4, E¢), where there is a one-to-one correspon- 
dence between F and E4, as shown in Figures la and 1b. A 
trail in G is a sequence of vertices T = (Vg, Ug41)--+) Ub41)s 
where ce; = (v;,U;41) € E, v; # v1, and e; F e; for 
a<i,j <b. To each trail 7 we associate a label L(r) 
(€,,€a¢15+-+>€s)- Consider a trail r of G and a trail r¢ of G2. 
A pair t = (r,7¢), with 7 being a trail in G and r¢ being a 
trail in G2, is called a dual-Euler trail (DET) if L(r)=L(r?). 
A set of disjoint DETs { ¢t,,...,¢, } is called a dual-cover if 
L(t;)Q L(;) = 9, for i # j, and Us_, L(t;) = E. An optimal 
dual-cover of (G, G4) is a minimum cardinality dual-cover, 
that is, a dual-cover with minimum s. 

A CMOS functional cell consists of two parts: the p-part 
representing PMOS transistors, and the n-part representing 
NMOS transistors. Each transistor has a polysilicon strip; 
one side of the polysilicon strip being a source and the other 
side being a drain. The p-part is a series-parallel inter- 
connection of PMOS transistors; similarly, the n-part is a 
series-parallel interconnection of NMOS transistors, and is 
the dual of the p-part. Representing the p-part and n-part 
interconnections by G, = (V,,#,) and G, = (Vi,En), re 
spectively the G, = G¢ and G, = G¢. In CMOS circuits, it 
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Figure 1: (a) NMOS graph. (b) PMOS graph. 
(c) Binary decomposition tree of (a)/(b). 


is possible to implement complex logic functions supported 
by complementary NMOS and PMOS transistors instead 
of by conventional NAND and NOR logic elements. The 
former implementation requires about half the area of the 
latter implementation, has less time delay and better per- 
formance. 

A systematic approach to layouts of CMOS functional 
cells has been proposed by Uehara and VanCleemput [UV]; 
we will refer to it as UV style. A UV layout can be viewed 
as a set of vertical polysilicon lines corresponding to gates, 
and a set of horizontal metal lines, corresponding to inter- 
connections among the transistors. A source or a drain of 
a transistor is connected to a source or a drain of another 
transistor either by horizontal metal lines or by adjoining 
their corresponding gates (their polysilicon vertical lines). 
The former requires metal connections; thus, it increases 
the height of the layout area. The latter does not require 
any connection. 


Consider a UV layout. Let a polysilicon pitch be the 
minimum separation between two polysilicon lines and a 
diffusion pitch be the minimum separation between two dif- 
fusion regions. Two polysilicon strips with common source 
or drain have a polysilicon pitch separation; otherwise they 
have a polysilicon plus diffusion pitch separation. An opti- 
mal UV layout is obtained when the transistors are “chained” 
(i.e., placed adjacent to each other) in an “optimal” man- 
ner. It has been shown [UV] that an optimal UV lay- 
out corresponds to an optimal dual-cover of (G,,G,). A 
heuristic algorithm for obtaining a dual-cover of (G,,G,) 
has been proposed in [UV]. Subsequently, two optimal al- 
gorithms running in O(|£,|) time in the RAM model were 
proposed [NBR,MH]. If (G,,G,) does not have a single 
dual-cover, then the algorithm of [NBR] cannot produce 
a layout [WPF]. 

In this paper, we will show an O(log|E,|) time and 


O(|E,|) processors algorithm, in the PRAM model, for ob- 
taining an optimal dual-cover of (G,,G,). As a subprob- 
lem, we will show how to separate a series-parallel graph 
G = (V,£) using O(1) time and O(|E|) processors — an 
improvement over previous O(log?|E|) time and O(|E|!+¢) 
processors result, « > 0 [GM] (algorithm of [GM] works on 
arbitrary planar graphs). The proposed algorithm is based 
on the divide-and-conquer principle. Aim is to recursively 
partition (G,,G,) into two “equal-size” subgraphs using a 
dual-graph separation theory. Then the processors collec- 
tively obtain an optimal dual-cover in each subgraph and 
combine them to produce an optimal dual-cover of (G,, G,). 
The technique we use in the combination step is an exten- 
sion of the Algebra proposed in [MH]. 

This paper is organized as follows. In Section 2 prelimi- 
nary definitions and results are given. The proposed parallel 
algorithm, for obtaining an optimal dual-cover, is presented 
in Section 3. An application of the proposed parallel algo- 
rithm to optimal UV-style layout of CMOS functional cells 
is described in Section 4 and experimental results are in- 
cluded. Details of the proposed implementation are given 
in Appendix A. 


2 Preliminaries 


A series-parallel graph (SP graph) is constructed by re- 
cursively applying “series” and “parallel” connections. It is 
a subclass of planar graph. We will introduce an effective 
method for finding all dual-covers of a pair of SP graphs 
with a fixed topology (non-permutable topology). 
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Figure 2: (a) A series connection. (b) A parallel connection. 


2.1 Abstract Model 


A Boolean logic function is modeled as a series-parallel 
graph G = (V,£) with F corresponding to the input sig- 
nals and V corresponding to the AND/OR operators. In 
each graph G, there are two distinguished terminal vertices 
labeled as N (the northern terminal) and S (the southern 
terminal). : 


Definition 1: Two subgraphs G, and G, have a series 
connection if they have one common vertex, and have a 
parallel connection if they have two common vertices (see 
Figure 2). 


Recursive combinations of a SP graph are described by 
a binary decomposition tree (BDT) T. Consider a SP graph 
G = (V,£) and a BDT T = (Vy, Er). Each leaf of T 
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corresponds to an edge of G and each internal vertex of 
I’ corresponds to a combination of two subgraphs G, and 
G,, either in series (noted as *) or in parallel (noted as 
+). Let T, and T, be two BDTs corresponding to G, and 
G2, respectively. The BDT T corresponding to SP graph 
G = G,#G, has a vertex labeled # with T, and T) as its 
left subtree and right subtree, respectively (see Figure 1), 
where # ts used as a generic symbol for (+, *). 


Consider a 2x2 terminal-matriz | a a 
ing to (G,G4), where N and S are the two distinguished 
vertices of G', and N¢ and S¢ are the two distinguished ver- 


tices of G2. Let two SP graphs G, and G, have terminal- 


correspond- 


: Ny 3S: N 
matrices Ne 54 | and | Ne r | respectively. A SP 
h G co G ° e N, ey" ° 
grap = G, * G, has a terminal-matrix cea p ole 
Nr SY | 


N, No}. S; No]. 
S : 
ti oy | if N, = Ng. Since the dual SP graphs G? and 
NY 1 


G? are connected in parallel when G, and G, are connected 
in series, then N@ = N¢ and S¢ = S4. Similarly, a SP 


graph G = G,+G, has a terminal-matrix be 3 if 
N, §S N, S 
d d 1 1 at 1 1 : 
sy = Ng, | Ht ny | ff 98 | Ne | NE = 58 
ot al iN? = Ne, Si the dual SP hs G¢ 
gd Sd 1 = Ng. Since the dua graphs G¢ and 


Gg have a series connection when G and G, have a parallel 
connection, then N, = N, and S, = Sp. 


2.2. Dual-Euler Trail 


Consider a pair of graphs (G, G4) and a dual-Euler trail 
¢ with L(t) = (€,,€a44,-..,€3). We call the starting and 
terminating vertices of a trail in G boundary vertices; simi- 
larly, we call the starting and terminating vertices of a trail 
in G? dual-boundary vertices (or, for short, d-boundary ver- 
tices). Note that a DET ¢ with L(t) = (e,,¢€,41,-.-,€,) and 
its “reverse” ¢” with L(t”) = (e3,..., €a41, €,) are equivalent. 
The boundary vertices v, and v44;, and the d-boundary ver- 
tices v? and Ue are used as the subscript of a DET label 
L(t) = (eg, €ay19+++s €) (va,vg) > (vo41, 08,4)" 

Following [MH] we say (v;,v?) is a terminal pair if v; is 
a boundary vertex of a DET t, vé is a d-boundary vertex 
of the same DET t, and both v; and v? are distinguished 
terminal vertices of a pair of graphs (G,G¢). A DET is 
distinguished if it has at least one terminal pair and two 
DET's are incompatible if they cannot be “joined” with each 
other. 

Each boundary vertex of a DET has type N, S, or J if it 
is the northern, the southern, or the internal vertex of the 
corresponding SP graph, respectively. A DET t has type 
(7,,72)/ (7,72), where J, and 7, are types of the bound- 
ary vertices, and, 7“ and 7 are types of d-boundary ver- 
tices. A boundary and d-boundary vertices pair (v;,v?) can 


be of type (N,N), (N,S), (S,N), (S,S), or (LD) ((N,D, (5,0), 
(I,N), and (1,S) are included in (I,I)). Therefore, a DET has 
25 possible types. Eliminating equivalent DET types ( for 
example (N,S)/(S,S) is equivalent to (S,S)/(N,S) ) and the 
(N,N)/(N,N), (N,S)/(N,S), (S,N)/(S,N), and (5,5)/(S,S) are 
four imposible DET types yields 11 possible types. Let Z 
denote incompatible DET types. The set of DET types are: 
[= { (N,N)/(S,S), (N,S)/(S,N), (8,N)/(S,S), 
(N,N)/(S,N),(N,N)/(N,S), (N,S)/(S,S), 
(SND, (,3)/(LD, (N.N)/GD, 
(N,S)/UD, GH/(L), Z } 


Theorem 1{MH] : The triplet (T, +, *) form an Algebra. 


Example : 
(a,b) (1,6) > 


matrix is 


Consider Figure 1. A DET ¢, with L(¢,)= 
(2,6) has type (N,N)/(S,N), since its terminal- 
: : Another DET t, with L(t.) = (c,d,f,e) 
(4,7) -» (4,7) has type (I,I)/(I,]), because its terminal-matrix 
is | 


7 410 and neither of the boundary and d-boundary 
eae: 


vertices pairs is a terminal pair. 


An (I,I)/(1,I) trail is called an internal DET. Note that 
an incompatible DET is not necessarily an internal DET, 
because two distinguished DETs cannot join together with- 
out compatible boundary and d-boundary vertices pairs. 


Theorem 2: There are at most four distinguished DETs 
in a dual-cover. 

Proof : We recall the definition Nie a dual-cover. All the 
DETs in a dual-cover are disjoint incompatible DETs. There- 
fore, any two distinguished DETs ¢ and ?’ in a dual-cover 
have types (Tz, T#)/ (Tey 72) # (T's, 7") (T'e T'4). Tt causes 
a compatibility for two distinguished DETs t and t’ when 


(7.12)/G I= (24,1 (0 sf), since the:types 7, 
@, 7., and 74 of a DET are constructed according to the 


same terminal-matrix. A graph can only have four dis- 
tinct types of terminal pairs (N,N), (N,S), (S,N), and (5,5). 
Note that the (N,I) and (S,J) are not legal types of terminal 
pairs. These construct four distinct types of a maximum 
cardinality incompatible distinguished DETs (N,N)/(I,J), 
(N,S)/(L,D), (S,N)/(1,D), and ($,S)/(1,1) in a dual-cover. Any 
other distinguished DET in the same dual-cover is compat- 
ible with two of those DETs (e.g., a DET with (N,N)/(S,N) 
is compatible with the DET (N, N) /(1,D) and with the DET 
(S,N)/(1,D)), and this contradicts the definition of a dual- 
cover. We conclude that there are at most four distin- 
guished DETs in a dual-cover. 0 


Let a concatenation step be the process of concatenating 
two dual-covers, that is, t; concatenates with t, if L(t,) NL (tz) 
= @ and t, and t, have a common vertex in G; and Gy (or 


G¢ and G4). In the resulting DET t, L(t)= L(t,) U L(@,). 


Lemma 1: An internal DET is not able to concatenate 
with any other DET. 
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Figure 3: (a) An internal DET of (G2, G?) and a 
distinguished DET of (G,,G d), 
(b) Incompatible DETs. 


Proof : Consider an internal DET t; of a pair of SP graphs 
and a distinguished DET ¢, of another pair of SP graphs. 
The two pairs of graphs are joined at terminal vertices. 
Therefore, a DET must be distinguished and have compat- 
ible terminal vertices with other DET for concatenation. 
However, t; has neither distinguished vertices nor compati- 
ble terminal vertices with t; in the combination step. There- 
fore an internal DET is unable to concatenate with the 


other distinguished DET. O 


An example showing incompatible DETs is depicted i in 
Figure 3. 


Let Match(t;,t;) = 1 if DET t; is compatible with DET 
t; ; otherwise Match(t;,t;) = 0. We define a trail-match to 
be the process of matching two distinguished DETs. Ac- 
cording to Lemma 1 and Theorem 2, a concatenation step 
can be done in at most 16 trail-matches. 


Let a dual-cover type 6 represent a set of distinguished 
DET’s types in a dual-cover. Each series-parallel operator 
# constitutes a pair of semigroup Algebras. Let Ao, Aj, 
A2, As, and A, represent the five styles (consisting of 0, 1, 
2, 3,4 distinguished DETs) of dual-cover Opes: That is : 


Ao = { D/L } 


Ai = { (N,N)/(S,S), (N,S)/(S,N), (S,N)/(S,S), (S,S)/(LD, 
(N,N)/(N,S),(N,S)/(S,S), (S,N)/(LD, (N.N)/(D), 
(N,N)/(S,N),(N,S)/(LD) 


Aa = { [(N,N)/(S.N), (N,8)/(S,8)], [((N,)/(S,N), (NN)/(S,8)], 
(ONN)/(LD, (S,S)/(D], (0N,N)/(LD, (N,8)/(L], 
[ON,NJ/CLD, (SN)/QDI,[(N,N)/GD), (N,8)/(6,)], 
[(N,N)/(LD, (S,NI/(SS), [ON,N)/(LD), (S.N)/(N,S)], 
(SS)/(LD, (SN/AD), ((,8)/C0, (NS)/(LD), 
(S,S)/(LD, (NN/(S,N)}, (S,8)/CLD), (N.N)/(N,S)], 
(NS)/(LD, NEN), (0NS)/C), GN)LDI, 
[N,S)/(LD, (S,NY/(SS)}, [0N,8)/G), (NND/(S,S)], 
(S.N)/(LD, (N,S)/(S,S)}, [(S,N)/(I), (N,ND/(S.S), 
(SN)/(LD, (NN)/(NS)], (EN,N)/(N'S), (S.N)/(S)] } 


Az = {[(N,N)/(LD), (N,S)/(LD), (8,8)/(LD) J, 
[ (N,N)/(L1), (N,S)/(LD), (8,N)/(S,S) ], 
[(N,N)/(LD), (S,N)/(L0), (8,8)/(LD) J, 
[ (N.N)/(D), (S,N)/(LD), (N,S)/(S,S) ], 
[ (N,S)/(LD), (S,8)/(LD), (N.N)/(S,N) J, 
[(N,S)/(LD, (S,S)/(LD, (8,N)/(LD) J,» 


[ (S,S)/(11), (S,N)/(L1), (N,N)/(N,S) ], 
[ (N,N)/(S,S), (N,S)/(L1), (S,N)/(LD) ], 
[ (N,N)/(LD, (S,S)/(LD, (N,S)/(S,N) ]} 


Aa = { [(N,N)/(D), (N,S)/(LD), (S,N)/(LD, (8,S)/(LD] } 


Note that an internal-type DETs is not involved in A,, 
A,, Az, and A,. The type (I,D)/(1,D in Ag is a single-trail 


dual-cover. 


Theorem 3{[MH] : There are exactly 42 dual-cover types 
in a series-parallel combination. 


Consider a set of SP graphs (G,G?). Let a dual-cover 
set D be an optimal set of dual-covers with minimum car- 
dinality. D is obtained by series or parallel combinations of 


two dual-cover sets D, and D, (i.e., D = D, #D,). 


Lemma 2: No two dual-covers in a dual-cover set have 
the same dual-cover type except in Ao. 

Proof : Each dual-cover set is an optimal set. Consider a 
set of dual-cover D. As we mentioned in Lemma 1, the dual- 
cover D(z) with 6(2) € Ag is unable to combine with D(j7), 
where 7 # 2. If two dual-covers D(j) and D(k) have 6(j) = 
6(k) with 6(j)andé(k) € A;, where i € {1,2,3,4}, they 
will have the same concatenations in the next combination 
step. This contradicts the definition of a dual-cover set. 0 


Lemma 3: Each dual-cover has the:smallest possible in- 
ternal DETs. 

Proof : Consider a dual-cover of a SP graph. Since it is a 
set of optimal disjoint DETs, then, except the distinguished 
DETs, all the internal DETs in the internal DET set must 
be the smallest possible set and disjoint with each other. O 


Lemmas 2 and 3, and Theorem 3 establish the following 
conclusion: Every dual-cover set obtained by a combination 
step of two dual-cover sets contains at most 42 different 
dual-covers. 

We call the combinations of dual-covers D,(z) and D,(7) 
from two dual-cover sets D, and D, a trailhunt step, where 
D,(2) € Dy, D7) € D3, 1 < y < ID, |, and 1 <j S |D.|. 
There are at most 42 x 42 x 16 = 28224 trail-matches for a 
trailhunt. In fact, there is only one dual-cover with A, type 
in a dual-cover set. Moreover, usually much fewer than 42 
dual-covers are included in an optimal set of dual-covers. 
Therefore, far fewer than 28224 trail-matches need to be 
performed in a trailhunt. 


2.3. Graph Separator Theory 


Consider a BDT T = (Vp, Ey). Let a cut-edge e, be an 
edge separating T into two “equal-size” sub-BDT’s. There 
exists a cut-edge in every BDT [LT]. The edge e, partitions 
T into T, = (Vz, E7,) and T, = (Vz,,E7,), where Er = 
Ex, UEx Ufec} and 3|Vr| < |Vxy|, [Vr] < 3|Vel- 

Every vertex v; ina BDT T = (V7, Ey), where 1 <i < 
|V| is the root of a (possibly empty) sub-BDT T;. Consider 
the cut-edge e, = (v,, v4). We call v, a cut-vertez if vg is the 
parent of v,. Two sub-BDTs 7; and Ty, are obtained from T 
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by removing the cut-edge. After the separation, the roots 
of T, and T> are v, and the root of T (e.g., Figure 4a), or 


@ Root vertex 


@ Cut vertex v, th 
s. 


@ Parent vertex of € . 
a a T2 
| ve 
(a) Farore: I 


Figure 4: Two kinds of tree separation. 


v, and the other child of v, (e.g., Figure 4b). 

When T is separated into two “equal-size” sub-BDTs 
T, and J, the corresponding graph G is separated into 
two “equal-size” SP subgraphs G, and G, with 7, being 
the BDT of G, and T, being the BDT of G,. Subgraphs 
Ny Sy 
Ny St 
2 | , respectively. Consider aSP graph G = G,#G, 

2 

a e | If G, and G, have 


two common vertices, then # = ‘*’, and, N, = N, and 
S, = S_ which are not necessarily N or S of G (see Fig- 
ure 5a). If G, and G, have one vertex in common, then 
# = ‘+’, N, = N, S, = Ny, and S, = S (see Figure 5b). 
The same rules apply to G?, G4, G4. For efficient imple- 
mentation of trailhunt, the terminal-matrices of (G,,G?) 
and (G, G4) have to be stored in order to decide the types 
of DETs. 

A SP graph C=(Vo, Ec). Let C= A & Bbe the union 
of two SP graphs A=(V,, £,) and B=(Vz, Eg), where Vo = 
V,U Vp and Eo = EAU Eg. 


G, and G, have new terminal-matrices and 


| 


with the terminal-matrix 


Ny 
No 


Example : Consider Figure 5a with G = AW BY C, 
when & denotes a composition of two graphs. When G 
is separated into G,; = B and G, = A W C, we observe 
that the new boundary vertices of G; and G, are the same 
as the vertices being split by the separation line. There- 
fore, the terminal-matrices of G, and G, derived from G 


are described as follows : | ne | —> | aie | & 
7 G Gy 
2 3 


4 6 4 5 
| 4 5 | . Again, consider Figuer 5b. As before, G = A 
G 


Ww Bw C. When G is separated into G,; = A and G, = B 
W C, the separation line cuts G, and G, at vertex 2 of G. 
Hence, the terminal-matrices of G,; and G, are not the same 
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Figure 5: (a) A SP graph corresponding to Figure 4a. 
(b) A SP graph corresponding to Figure 4b. 


3 Parallel Algorithm for 
Minimum Dual-Cover 


Utilizing the concepts discussed in Section 2, we will de- 
velop a parallel algorithm for solving subproblems of mini- 
mum dual-covers. After then, we integrate the algorithms 
to obtain a minimum dual-cover. 

Here, we assume that a binary decomposition tree has 
been constructed ( the construction of a BDT will be dis- 
cussed in the next Section ). We aim to employ the divide- 
and-conquer principle for separating the SP graphs and the 
corresponding BDTs. First procedure is called TREE SEP- 
ARATION which decomposes a BDT into two sub-BDTs, 
thus the corresponding SP graph will be separated into two 
subgraphs. The procedure TRAILHUNT combines two op- 
timal dual-cover sets into one optimal dual-cover set. Each 
dual-cover set shows the optimal DETs of the corresponding 
SP graphs. 


3.1 ‘Tree Separation 


In the procedure TREE SEPARATION, first we find a 
cut-edge and then separate the given BDT T= (Vz, Ey). 
In the last step we delete the leaves no longer belonging to 
the vertices on the path from cut-vertex up to the root. A 


formal description of TREE SEPARATION is given below. 


Procedure TREE SEPARATION 
begin 
(1) pardo for all sub-BDTs 7; at vertices v; 
begin 
if Vel < Val < 31Ve" 
then f/f; := TRUE; 
else f; := FALSE; 
parend; 
(2) select a cut-edge e, from all v; with f; = TRUE; 
(3) separate the tree into two “equal-size” sub-BDTs by e,; 


(4) pardo for all tree vertices v; € the path (v, —> the root); 


delete the leaves not belonging to T;; 
end; 


Lemma 4: TREE SEPARATION runs in O(1) time and 
uses O(|V7|) processors. 

Proof : Consider the BDT T= (Vr, Fy). Assume the tree 
path (v; —— the root) and the leaves under v; have been 
constructed, where 1 <i < |V;|. It is trivialy seen that Step 
1 can be done in constant time using |V7| processors. Steps 
2 and 3 run in constant time, as well. The last step takes 
constant time, for it involves cutting off the leaves under 
the sub-BDT T, from the sub-BDT T; while z € the path 
(v. ——> the root). Thus, we conclude that TREE SEP- 
ARATION runs in O(1) time and uses O(|Vz|) processors. 
O 


The separation technique of [GM] can be used to sep- 
arate the SP graph, too. But their algorithm, devised for 
arbitrary planar graphs, runs in O(log?|V;|) time and uses 
O(|V;|!+*) processors, « > 0. In the series-parallel graph 
applications, our algorithm TREE SEPARATION is much 
better than the algorithm in [GM]. 


3.2. ‘Trailhunt 


Consider a SP graph G = (V, £) and its BDT T=(Vz, Er) 
with |V;| = 2|E| — 1. When G is separated into |E| single- 
edge subgraphs, T is decomposed into |F]| single-leaf sub- 
BDTs. The TRAILHUNT recursively combines two pairs 
of subgraphs and generates all possible concatenations from 
two optimal dual-cover sets. An optimal dual-cover cover- 
ing new graph is thus obtained by applying TRAILHUNT 
recursively. 

Consider two subgraphs G, and G, and their optimal 
dual-cover sets D,; and D,. A dual-cover D,(2) € D, with 
6,(z) € {Ag, Ag, Ay} might be compatible with D,(k) € D, 
having 6,(k) ¢ Ag while a single-DET D,(j) € D, with 
6,(7) € A, is incompatible with D,(k). Consequently, ex- 
cept keeping the single- DET dual-covers we need to keep all 
the possible dual-covers D,(z) that satisfy Lemmas 2 and 3 
in a dual cover set D,. 

For an optimal dual-cover set Dz = D,#D,, we define 
a function COMBINE(D,(z), D2(7)) for obtaining a com- 
bination of D,(z) and D,(j), where 1 <i < |D,| and 1 < 
j <|D,|. Consider two distinguished DETs t, with L(¢,) = 
(€1, €2,-++5 €n)(y, ve) 2s Was VER) and t, with L(t.) = 
(C1, €b0 + eid) (Vig Wiha)" A trail-match step checks 
the boundary and d-boundary vertices pairs of t, and ft, 
(Vi, Ve), (Vizi Ve 2), (V1, Vs), and Vas Vio): ty and 
t, are concatenated into one DET ¢ if they match each 
other at the boundary vertices and the d-boundary vertices, 
that is, (V,V4) = (VV%), VsVe) = (Vi Vad) 
(Visas Veg) - (Vy, ve), OF Vise V ep) = CE Aer? 
Otherwise, they are incompatible. : 

After a COMBINE step, let d be the resulting dual- 
cover. In order for D3 to be an optimal set, every dual- 
cover in Dz needs to satisfy Lemmas 2 and 3. If any dual- 
cover D3(k) € D3 has 6,(k) = 6(d), we choose the one with 
less internal DET’s and discard the other non-optimal dual- 
cover. If no such dual-cover D3(k) exists then d is included 


in Ds. 


In TRAILHUNT, first, COMBINE(D,(#), D2(j)) sequen- 
tially matches two distinguished DETs from D, and D, to 
generate a new dual-cover d. Then, d is checked against 
the restrictions imposed by Lemmas 2 and 3. If d satisfies 
the conditions then D; = D;U {d}; otherwise d is discared. 
Therefore, the optimality of D3 is ensured. Now, we give a 
formal description of TRAILHUNT algorithm. 


Procedure TRAILHUNT(D,, D2) 
begin 
for 7 := 1 to |Dj| 
for j := 1 to [Do| 
begin 
d = COMBINE(D, (i), Da(j)); 
if 6(d) = 63(k),1<k < |D3| and 
linternal DETs of d| < |internal DETs of D3(k)| 
then begin 
D3 := D3 — {Da(k)}; 
D3 = D3 U {4}; 
end 


else if 5(d) a 63(k), Vk, and 1 < k < |D3| 
(2.3) then D3 := D3 U {d}; 
end 


end; 


Lemma 5: TRAILHUNT runs in O(1) time and uses one 
processor. 

Proof : Size of a set of dual-cover, as proved in Theorem 
3, is at most 42. Therefore, |D,| x |D,| < 1762, that is, 
the number of iterations. Step 1 performs at most 4 x 4 = 
16 trail-matches, because in Theorem 2 it was proved that 
there are at most 4 distinguished DETs in a dual-cover. 
Step 2 clearly takes constant time, for |D3| < 42. Steps 2.1 
to 2.3 each takes constant time, as well. Thus total running 
time is O(42 x 42 x (16 + 42)) = O(1). We conclude that 
TRAILHUNT takes O(1) time and employs one processor.O 


Example : Consider Figure 6. When the BDT in Figure 
6a is separated into two sub-BDTs, the SP graphs in Fig- 
ures la and 1b each is split into two subgraphs as shown 
in Figures 6b and 6c, respectively. Now, we one pos- 
sible dual-cover of (G,,G?) D,(2) = { (b,a)(2,7) 5 ot 
(g:b) (1,10) — (2,10) } and one possible dual-cover me (Go. 
DAI)= { (4) a7) + ar (efe,7 — (2,10) J and com- 
bine them. As the SP graphs are described in DET forms, 
the trail-match steps are independent of whether the com- 


bination of two dual-covers is in series or is parallel. It is 
obvious that the boundary and d-boundary vertices pairs 


of D,(z) and D,(j) are matched : (1,7) of (a,b) and (1,7) 
of (c,d), (1,10) of (g,h) and (1,10) of (c,d), (2,10) of (g,h) 
and (2,10) of (e,f), and (2,7) of (b,a) and (2,7) of (ef) 
are matched. A new dual-cover can be concatenated for 
example as (a,b,e,fh,g.d.c)q 7 — «,7) (see Figure 6d) or 
(b,a,¢,d,g,h,fe)(2.7) + (2,7): 

The following theorem is readily established by virtue 
of Lemmas 2 and 3. 


Theorem 5 : Two dual-covers can be optimally combined 
in O(1) time using one processor. 
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) The BDT of Figures la and 1b. 
. ) The separated SP subgraphs (G,G?) and 
(G,, G2) of Figure(1a, 1b). 
(d) One optimal dual-cover of Figures (la, 1b). 


3.3. Optimal Dual-Cover 


We have derived a technique (TREE SEPARATION) for 
partitioning a pair of SP graphs in parallel. After log |E| 
iterations of TREE SEPARATION the graphs G = (FE, V) 
and G? = (E74, V2) are partitioned into |E| pairs of single- 
edge SP subgraphs. After then, these subgraphs are com- 
bined in parallel; after log |E'| iterations of TRAILHUNT 
the optimal dual-covers with minimum cardinality gener- 
ated. In TREE SEPARATION and TRAILHUNT, the par- 
allel separations and combinations are independent of the 
types of operations (series or parallel) in the correspond- 
ing SP graphs. The terminal vertices of SP graphs are of 
concern. 

The algorithm OPTIMAL DUAL-COVER separates a 
BDT T = (Nz, Lr) and the respective SP graphs G 
(V, E) and G4 = (V4, E4) each |E| sub-BDTs and |F| pairs 
of subgraphs. Next, it combines the subgraphs to get the 
desired optimal dual-covers of (G,G%). A one-edge dual- 
cover set is initialized as D={ (€)(yy,04)  (vs,09)3 


Figure 6: 


(©) (uy, 04) > (vs,v%) ie 


Procedure OPTIMAL DUAL-COVER(T) 
begin 
pardo for all active processors each associating 
with a sub-BDT 7; 
begin 
TREE SEPARATION; 
push terminal-matrix of T;; 
set two new terminal-matrices of sub-BDTs T;, and T;,; 
activate an available processor to perform the T;, ; 
pardo for T;, and k € {1,2} 
if Vr, | Sl 
then OPTIMAL DUAL-COVER(T; ae 
else initialize the dual-cover set of T;, ; 
parend; 


pop terminal-matrix of T;; 
TRAILHUNT(D,, , D;,); 
release a processor; 
end; 
end; 


Lemma 6: OPTIMAL DUAL-COVER runs in O(log |E]) 
time and uses O(|]|) processors with |£| being the number 
of edges of the input SP graph. 

Proof : With an input set of SP a G = (V,E£) and 
Gt = (V4, E¢), employing |E| processors, the parallel al- 
gorithm OPTIMAL DUAL-COVER takes O(log |E]|) it- 
erations of TREE SEPARATION to get |E| single-edge 
subgraphs. Additional O(log |E|) iterations of TRAIL- 
HUNT are required to combine these subgraphs to obtain 
the resulting dual-cover set. Step 5 is performed recursively. 
Steps 1 and 8 each takes constant time as proved in Lemma 
4 and lemma 5, respectively. Steps 2 and 7 require con- 
stant time to access the shared memory. Step 3 utilizes the 
concepts introduced in Section 2.3 and is done in constant 
time. Steps 4 and 9 need constant time to acknowledge the 
processors, and Step 6 clearly requires constant time for ini- 
tializations. We conclude that OPTIMAL DUAL-COVER 


runs in O(log |E]|) time and uses O(|E|) processors. 0 


4 Optimal Layout Of CMOS 
Functional Cells 


In this Section, we will apply the algorithm OPTIMAL 
DUAL-COVER proposed in Section 3 to get optimal lay- 
outs of CMOS functional cells. We consider UV style [UV] 
layout of CMOS functional cells. As is customary, we as- 
sume the p-part and the n-part interconnections are serles- 
parallel graphs with fixed topologies. 


4.1 Graphs Models Of CMOS Circuits 


Consider a pair of SP graphs (G,G?) representing a 
CMOS circuit. Let G represent the n-part of CMOS circuit, 
and G® represent the p-part of CMOS circuit. For instance, 
Figure la represents the NMOS transistors of Figure 7b and 
Figure 1b represents the PMOS transistors of Figure 7b. 


4.2 Graph and Tree Transformations 
From Boolean Expressions 


. We assume the input is a Boolean expression represent- 
ing the NMOS interconnections. In order to apply the 
OPTIMAL DUAL-COVER algorithm proposed in Section 
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Figure 7: (a) A logic diagram with Z= (a « b)+((c-+d) * (e+f))+(g * 4). 
(b) The CMOS circuit. (c) The optimal layout. 
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3, the aout function is transformed into a pair of series- 
parallel graphs (G,,G,) with G, = Gs and G, = G?, and 
the corresponding binary decomposition tree Ti is to Be ob- 
tained. We find postfix notation most convenient for input 
representation. 7 

From the Boolean expression, we define G, = (V,, E,,) 
representing the n-part, G, = (V,, E,) representing the p- 
part, and T= (Vr, Er) representing the corresponding SP 
graphs (G,,,G,). The number of transistors in p-part and n- 
part are iE, | = |E,| = 4(\Vz| +1). Each vertex in (G,,G,) 


dictates the Ri ee onneetion of sources and drains of a sub- 


set of transistors. 

In BDT TRANSFORMATION, We achat NODE] to 
be the /th input symbol, OP[i] to be the ith operator, and 
VARIj] to be the jth variable. OP|i]’s two child- vertices 
can be of VAR{j—1] and VAR{[j], VAR{j] and OP[i—1], or 
OP[i—1] and OP[i—k — 2] while one of the child-vertices 
dominates k operators. Each VAR{j] is a leaf of BDT. In 
BDT TRANSFORMATION, first, we scan the input string 
representing a Boolean function and store the symbols in 
OP{[i] or in VAR{j] appropriately. Then each symbol links 
to its parent OP[p] with p > i and p > j — 1, and points to 
its two child-vertices if this symbol is an operator. 


Lemma 7: BDT TRANSFORMATION runs in O(log n) 
time and uses O(n) processors with n being CMOS gates. 


The algorithm GRAPH TRANSFORMATION constructs 
a pair of SP graphs (G, G?) by assigning a terminal-matrix 
to each BDT vertex, which is applied after the tree struc- 
tures have been established. In the GRAPH TRANSFOR- 
MATION, Mr, Mi, and Mj, are defined as the terminal- 
matrices of a BDT T, of its ‘ett sub-BDT, and of its right 
sub-BDT, respectively. 
Lemma 8 : GRAPH TRANSFORMATION runs in O(1) 
time and uses O(n) processors with n being CMOS gates. 


By virtue of Lemmas 7 and 8, we conclude : 


Theorem 4: A set of SP graphs (G,,G,) and corre- 
sponding BDT T are established in O(log |E|) time using 

O(|E]|) processors from an input Boolean expression, where 
|E| is the CMOS gates. 


Based on Theorem 4 and Lemma 6, OPTIMAL LAY- 
OUT is used for obtaining the dual-cover set of SP graphs 
(G, G4), The dual-covers with minimum cardinality of DET's 
minimize the CMOS layout area. 


Procedure GRAPH TRANSFORMATION 
begin 


I. 2 
set the terminal-matrix of T;,,, to be Moot = | 3.4 fF 


a bj. 
ec d}’ 


a f r 8 
then M6pri al f |e Mona = | | 


pardo for all Top, with matrix Mop ‘= | 


begin 


parend; 
end; 


Procedure OPTIMAL LAYOUT 
begin 
BDT TRANSFORMATION; 
GRAPH TRANSFORMATION; 
OPTIMAL DUAL-COVER; 
output the dual-covers with minimum cardinality of DETs; 
end; 


(1) 
(2) 
(3) O 
(4) 


Lemma 9: OPTIMAL LAYOUT runs in O(log n) time 
and uses O(n) processors with n being the CMOS gates. 


Figure 7c has different gates permutation from the lay- 
out in [MH], and has one metal tracks less than [MH]’s 6 
tracks, which leads to a smaller area. Therefore, the opti- 
mal dual-cover is not unique. In fact some are preferred to 
others, and any arbitrary dual-cover may require a “large” 
number of tracks [5]. From Lemma 9, the following Theo- 
rem 6 is readily established. 


Theorem 6: An optimal UV style layout of a CMOS 
functional cell is obtained in O(log n) time using O(n) pro- 
cessors in PRAM model with n being the CMOS gates. 


4.3. Experimental Results 


The divide-and-conquer algorithm outlined in this paper 
has been implemented in the C programming language on 
VAX/UNIX BSD 4.3 and the output is displayed on SILI- 
CON GRAPHICS IRIS 2400 work station. The bottleneck 
running time of this simulation program is TRAILHUNT. 
Therefore, we use one processor (VAX machine) to approxi- 
mate the longest TRAILHUNT running time in OPTIMAL 
DUAL-COVER as an time unit, then multiply it by log n 
as the OPTIMAL DUAL-COVER running time shown in 
Figure 8. We also use the algorithm [GLL] which runs in 
O(n log n) time using one processor (RAM model) in our 
simulation program to compact the layout, that is, to com- 
pact the layout height. 
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Figure 8: The OPTIMAL LAYOUT running time using 


n processors (n is the gates number). 
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LOOKAHEAD IN PARALLEL DISCRETE EVENT SIMULATION 


Richard M. Fujimoto! 
Computer Science Department 
University of Utah 
Salt Lake City, UT 84112 


Abstract 


Empirical performance evaluations of parallel, discrete event simulation 
algorithms using deadlock avoidance and deadlock detection and recovery 
techniques developed by Chandy and Misra have been performed using the 
BBN Butterfly’ multiprocessor. Experiments using synthetic workloads 
reveal that the degree to which processes can look ahead in simulated time 
plays a critical role in the performance of distributed simulators using 
these algorithms. These results are applied to a queueing network 


simulation where as much as an order of magnitude improvement in’ 


performance is observed if the distributed simulator is programmed to fully 
exploit the lookahead available in the application. 
measurements of several hypercube-based communication network 
simulators provide additional empirical data to support these claims. 
These results demonstrate that substantial improvements in performance 
are obtainable if the application can be programmed to have good 
lookahead characteristics. On the other hand, other applications inherently 
contain poor lookahead properties, and appear to be ill-suited for these 
simulation algorithms. 


1. Introduction 


Discrete event simulation has long been a task with computation 
requirements that challenge the fastest available computers. For example, 
simulations of communication networks, parallel computer architectures, 
and battlefield scenarios often require hours, days, or even weeks of CPU 
time using traditional, single processor techniques. Simulator performance 
may be improved using vectorizing techniques [Chan83a], processors 
dedicated to specific simulation functions [Comf84a], execution of 
independent trials on separate processors [Bile85a], or the execution of a 
single instance of a simulation program on a parallel computer. The last 
technique, referred to as distributed simulation, is the subject of this paper. 


Simulation would initially appear to be a natural candidate for parallel 
processing because many of the aforementioned applications contain a 
high degree of parallelism. However, the exploitation of this parallelism is 
elusive because the global notion of simulated time does not easily map 
onto a distributed computer. This property distinguishes distributed 
simulation from other forms of parallel computation. 


Several schemes have been proposed to solve this problem. A survey of 
the literature has been reported by Kaudel [Kaud87a]. One important class 
of distributed simulation algorithms is the so-called ‘‘conservative’’ 
mechanisms. Chandy and Misra developed a mechanism based on a 
deadlock avoidance technique where null messages are used to distribute 
clock information among the processes taking part in the simulation 
[Chan79a, Misr86a]. Another mechanism, also developed by Chandy and 
Misra, is based on a deadlock detection and recovery paradigm — the 
simulator runs until deadlock, the deadlock is detected, and an algorithm is 
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executed to break the deadlock [Chan81la, Misr86a]. Other approaches to - 


distributed simulation have been proposed, notably the Time Warp 
approach proposed by Jefferson [Jeff85a], but the work discussed here will 


be confined to deadlock avoidance and deadlock detection and recovery 
techniques. 


In [Fuji88a] several experiments using synthetic workloads were 
described that were designed to evaluate the effectiveness of distributed 
simulation strategies using the deadlock avoidance and the deadlock 
detection and recovery algorithms. These experiments were performed on 
a distributed simulation testbed that was implemented on the BBN 
Butterfly, ™ a shared-memory multiprocessor. Here, we apply these 
results to specific application problems to provide empirical data to support 
these results. In particular, parallel simulations of queueing networks and 
the communication subsystem of a hypercube-based multicomputer 
demonstrate the relationship between lookahead in the simulation 
application and performance of the parallel simulator. 


'This work was supported by ONR contract number N00014-87-K-0184 and NSF grant 
number DCR-8504826. 
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2. Logical Processes, Activities, and Lookahead 


Logical processes, activities, and lookahead form the basis for the 
synthetic workload model that is used here. The simulation program 
consists of some number of logical processes, each of which models some 
portion of the system being simulated. For example, in simulating a digital 
logic network, each gate (or some collection of gates) could be modeled by 
a logical process. Logical processes communicate exclusively by 
exchanging timestamped messages. Messages typically correspond to 
events that trigger a change in system state. Each logical process must 
process incoming messages in non-decreasing timestamp order to ensure 
that cause-and-effect relationships are faithfully reproduced by the 
simulator. 


We informally define an activity as a sequence or thread of events that 
propagates among the logical processes in the simulation. These events 
model some sequence of cause-and-effect relationships in the system being 
simulated. For example, in a logic simulation, individual events are logic 


Signal transitions and each activity corresponds to a signal propagating 


through a sequence of logic gates. In a queueing network simulation, each 
activity corresponds to a job traveling through the network. Activities are 
usually dynamic. A new activity is created in the logic simulation 
whenever an existing activity reaches a fanout point in the network. The 
activity disappears when (for instance) it reaches an AND gate with a logic 
zero on one of the other input lines. For. our purposes, this informal 
definition of activities and logical processes will suffice. 


Logical processes often ‘‘look ahead’’ into the simulated time future to 
schedule new events. For example, upon receiving a signal transition 
event in a logical process for an inverter gate, the process can predict and 
schedule a new event (a signal transition at the output of the gate) one gate 
delay later in simulated time. The lookahead abilities of the process 
determine how readily it will schedule new events. Processes such as the 
inverter with good lookahead abilities can ‘‘see’’ sufficiently far into the 
future that ‘‘effect’’ events can be scheduled as soon as the ‘‘cause’’ event 
is received. On the other hand, processes with poor lookahead ability must 
first wait until simulated time is advanced before they can schedule the 
effect event. For example, in a queueing network simulation with 
prioritized jobs, the “‘departure’’ event for a low priority job cannot be 
scheduled until it is first determined that no higher priority job will 
preempt it. 


Quantitatively, lookahead is defined as follows: if a process has 
knowledge of all events that will occur up to simulated time T, and can 
predict all new events it will generate with timestamp T+ LZ or less, then 
the process is said to have lookahead L. In general, lookahead is a 
complex function that varies with time and the type of event, and is highly 
dependent on details of the simulation problem and the way it is 
programmed. A process can schedule a future event so long as the 
timestamp on that event is less than or equal to the process’s local clock 
plus its lookahead. Such events are said to be within the ‘‘lookahead 
horizon’’ of the process. 


Consider a ‘‘cause’’ event with timestamp T.¢,5. that leads to an 
“effect’’ event with timestamp T,¢,,,. The absolute value of lookahead is 
not as important as the lookahead relative to Tegect — Teause» because this 
will determine how far the process must advance in simulated time to 
generate the new event. Therefore, we define a quantity referred to as the 
lookahead ratio (LAR): 


T effect ca d aise 
lookahead 
A low (e.g., 1.0) LAR corresponds to a high degree of lookahead. 


LAR = 


3. The Distributed Simulation Testbed 


An 18 processor BBN Butterfly multiprocessor was used for 
experimentation. Each processor node contains a 16 MHz MC68020 with 
MC68881 floating point coprocessor, 1 to 4 MBytes of memory, and a 


Table 1. Hardware Parameters 


Operation Execution Time 
(microseconds) 
Local memory reference 


Remote memory reference 


Register-to-register instruction 
16 bit Load (Local Memory). 
16 bit Load (Remote Memory) 
Parameterless function call 
Atomic inclusive OR 


processor node controller (PNC), a microcoded engine that processes local 
and remote memory requests. The interconnection switch is configured as 
an Omega network. Atomic test-and-set like memory operations are also 
implemented in the PNC. Execution times of various instructions and 
operations are shown in table 1. Experimental data indicate that switch 
contention, and hot spot congestion in particular, is unlikely [Thom86a]. 


Each processor executes a single operating system process. This 
process is a scheduler that time multiplexes execution of the simulation 
processes mapped to the processor. This strategy avoids excessive context 
switching overhead, and allows more direct control over the process 
scheduling mechanism. Asynchronous message passing primitives were 
constructed using direct memory accesses to the mailbox in the receiving 
simulator process. Only a few simple Butterfly primitives, namely lock 
and atomic-add operations, are used by the testbed after initialization is 
complete. 


4. The Simulation Algorithms 


Two distributed simulation algorithms were implemented in the testbed: 
one based on deadlock avoidance and another based on deadlock detection 
and recovery. The shared memory architecture of the Butterfly was used 
to improve the efficiency of these algorithms, as described below. A single 
processor, event list implementation was also developed in order to 
compute speedup. 


4.1 Deadlock Avoidance Strategy 


The deadlock avoidance scheme developed by Chandy and Misra was 
implemented first. Each logical process sends a null message to each of its 
neighbors whenever it blocks. The timestamp on this message represents a 
lower bound of the timestamp on any message that will be sent to the 
receiver in the future. It is equal to the local clock value of the process 
plus the lookahead value because, by definition, the process cannot predict 
the occurrence (or non-occurrence) of events further into the future. 
Chandy and Misra have shown that this approach is sufficient to avoid 
deadlock [Chan79a]. 


In the testbed, one optimization was performed to streamline the 
processing of null messages. Rather than enqueueing each null message 
sent to another processor, a single variable is associated with each input 
link that contains the timestamp of the last null message that was received. 
This avoids unnecessary enqueue and dequeue operations and leads to 
more efficient memory utilization. 


4.2 Deadlock Detection and Recovery Strategy 


The second simulation approach is based on deadlock detection and 
recovery. The simulation runs until deadlock, the deadlock is detected, 
and an algorithm is initiated to break the deadlock [Chan81a]. A central 
controller is used to coordinate the deadlock recovery procedure. 


Deadlock in the testbed is easily detected by maintaining a global 
counter indicating the number of processes that are either scheduled or 
running. The system is deadlocked whenever the counter reaches zero and 
there is at least one process that has not yet terminated (otherwise, the 
computation has terminated). Each scheduler checks the deadlock counter~ 
whenever it fails to find a process to run, and initiates a computation to 
break the deadlock if it finds the counter is zero. 


The deadlock recovery algorithm locates the message in the system with 
the smallest timestamp and arranges for it to be processed next. A 
distributed algorithm is used to perform this computation. A central 
controller is used to coordinate this activity. By convention, the scheduler 
executing on PE 0 acts as the controller. 


An alternative deadlock recovery algorithm was also implemented in 
which messages are propagated throughout the system in order to restart as 
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many processes as possible. This algorithm is described in [Chan81la]. It 
was found, however, that the additional time required to execute this 
algorithm yielded a net loss in performance. The performance figures 
reported here are based on the former deadlock recovery approach. 


4.3 Uniprocessor Simulation Algorithm 


Finally, a single processor, event list simulator was developed to allow 
comparison of distributed simulation programs with sequential event list 
implementations. In order to obtain a fair comparison, the uniprocessor 
simulator was constructed by modifying the distributed simulator. Both 
implementations maintain the same overall structure, organization, 
programming style, and conventions. All code specific to parallel 
computation (e.g., synchronization locks) was eliminated. 


The event list was implemented as a splay tree [Slea85a]. Empirical 
evidence suggests that splay trees are among the fastest methods for 
implementing an event list [Jone86a]. An alternative implementation 
using a singly linked linear list was also developed. It was found that this 
implementation yielded performance comparable to the splay tree for small 
simulations but, as expected, ran much more slowly for the larger 
simulations. The splay tree implementation is used in all comparisons with 
uniprocessor simulations reported here. 


4.4 Performance Metrics 


Three metrics are defined to evaluate the performance of the distributed 
simulation programs: 


e Speedup. SU(n), the speedup using n processors, is defined as the 
execution time of the single processor, event list implementation using a 
splay tree divided by the execution time of the distributed simulation 
program when n processors are used. 


e Null Message Ratio. NMR is defined as the number of null messages 
processed by the simulator using deadlock avoidance divided by the 
number of real (non-null) messages processed. This measures the 
overhead of the deadlock avoidance approach. 


e Deadlock Ratio. DR is the number of messages processed by the 
distributed simulator using deadlock detection and recovery, divided by 
the number of deadlocks that occur. This figure measures the efficiency 
of the deadlock detection and recovery algorithm. 


The single processor execution times were obtained by running the splay 
tree simulator on a single node of the Butterfly. The same compiler as that 
used by the distributed simulator was used. Therefore, compiler and 
processor speed dependencies are factored out of the speedup figures. 


The experiments were performed with no other applications running on 
the Butterfly. Facilities, such as the window manager, were run on 
processors different from those executing the simulation program. These 
measures were taken to minimize interference with the computation. 


Experimental data were, for the most part, well behaved. The 95 
percent confidence intervals for the measured data were typically less than 
One or two percent of the reported value. Only in a few instances were 
significant variations observed from one measurement to another. These 
were related to the avalanche effect described later, and do not affect the 
conclusions that follow from these experiments. 


5. Experiments Using Synthetic Workloads 


Synthetic workloads were constructed based on the notions of logical 
processes, activities, and lookahead, described earlier. Workloads 
contained 16 and 64 logical processes organized in 4 by 4 and 8 by 8 
toroids, respectively (a toroid is a nearest neighbor mesh with wrap-around 
edge connections). Toroids were used because they do not contain 
inherent bottlenecks that might color the results, and because they are rich 
in cycles, and therefore represent a reasonably challenging configuration 
for the simulation algorithms. It is assumed that the number of activities in 
the simulation remains constant, and the lookahead of each process 
remains fixed throughout the simulation and does not depend on the type 
of event. Within each experiment, a fixed number of messages (the 
message population) circulates in a manner similar to jobs traveling 
throughout a closed queueing network. Simulation activity in each process 


- was emulated using busy wait loops. 


The experiments discussed next assume a message population of four 
messages per process and an average computation time of 1 millisecond 
(selected from a random variable with a negative exponential distribution) 
to process each incoming message. A static process to processor mapping 
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Figure 1. Speedup of synthetic workload as lookahead is varied. 
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was used that balanced the workload assigned to the available processors 
while minimizing interprocessor communications. 
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exneriments were conducted to examine the effects of 
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computation granularity, dynamic load balancing, message population, 
message routing, and other factors. A detailed description of these results 
is beyond the scope of the present discussion, but is described elsewhere 


[Fuji87a, Fuji88a]. We will summarize some of these results and discuss _ 


how they can be applied to a specific application. 


5.1 Effect of Lookahead 


The speedup curves in figure 1 show the effect of varying lookahead in 
the deadlock avoidance simulator. As can be seen, lookahead plays a 
critical role in determining simulator performance. Performancé degrades 
significantly as the lookahead ability of each process is reduced. Processes 
with poor lookahead characteristics must delay generating new events, 
reducing the amount of parallelism available in the simulation. 


Performance of the 16 node toroid is somewhat less than the 64 node 
toroid because the simulation does not contain sufficient parallelism to 
keep all of the processors busy. In addition, as the number of processes 
per processor is decreased, each process is afforded less time to collect 
messages before it is executed by the scheduler. As a result, a process may 
be scheduled more often than if there were more processes mapped to the 
processor. The additional scheduling overhead and increased idle time 
lead to poorer performance in the 16 node simulator, particularly as the 
number of processors is increased. 


5.2 Message Avalanche 


Experiments using the deadlock detection and recovery strategy also 
revealed an ‘‘avalanche’’ phenomenon. This behavior is depicted in figure 
2 where the deadlock ratio is plotted as a function of the message 
population. Performance remains poor (only a few messages processed 
between deadlocks) at low and moderate message populations, but then 
increases dramatically once message population reaches a certain critical 
level. It was found that message avalanche was a prerequisite for 
achieving good performance for this simulation strategy. 


Message avalanche occurs when a message arriving at a process causes 
the transmission of one or more additional messages, which in turn trigger 
the transmission of still others, and so on. A multiplicative effect occurs 
whereby an ‘‘avalanche’’ of message traffic results from the original, 
accounting for the dramatic improvement in simulator efficiency. 


As shown in figure 2, the message population required to induce 
avalanche was found to be dependent on the lookahead ability of the 
processes. Smaller populations were required to induce avalanche if 
processes were able to see far into the simulated future. This is again 
because poor lookahead characteristics reduce the amount of parallelism in 
the simulator. 
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Figure 2. Message avalanche occurs as the message population is increased. 


5.3 Processes with Different Lookaheads 


The experiments described above used homogeneous workloads where 
each process behaved in the same way as the others. Many real 
simuiations contain a variety of logical processes witli different iookahica 
characteristics. Additional experiments were performed in which some 


processes had poorer lookahead characteristics than the others. 


Figures 3 and 4 show simulator overhead for the deadlock detection and 
recovery, and deadlock avoidance simulators, respectively, when some 
number of processes with poor lookahead characteristics are mixed with 
processes with good lookahead characteristics. Experiments were 
performed in which one, one fourth, one half, and finally all processes 
have poor lookahead (high LAR ). Figure 3 indicates that the presence of a 
few processes with poor lookahead results in a perceivable performance 
degradation in the deadlock detection. and recovery simulator (the 
avalanche point is moved to higher message populations). When a 
significant fraction of the processes have poor lookahead, performance is 
almost the same as that when all processes have poor lookahead. The 
deadlock avoidance simulator was found not to be as susceptible to such 
behavior (see figure 4), though some degradation results if a sufficiently 
high fraction have poor lookahead properties. 


6. Queueing Network Simulations 


To illustrate the applicability of the above results in a specific 
application, queueing network simulations were performed. A five 
process, central server network was simulated on the testbed. As shown in 
figure 5, this network contains three first-come-first-serve (FCFS) 
processes that service incoming jobs in the order in which they agrive, a 
fork process that stochastically routes each incoming job to one of its 
output ports (assume for now that either port is equally likely to be 
selected), and a merge process that combines streams of incoming jobs into 
a single output stream. Each server process also computes the average 
number of jobs in the server and reports this figure to the user. 

Simulation and empirical studies by Seethalakshmi and Reed 
respectively concluded that the central server network is ill-suited for the 
conservative distributed simulation algorithms discussed here 
[Seet79a, Reed88a]. We reproduce and explain the poor results that these 
researchers observed in terms of message population and lookahead, and 
utilize this knowledge to improve performance. 


The ‘‘classical’’ implementation of the FCFS process uses two types of 
events: arrival events (scheduled by other processes) denote jobs arriving 
at the server, and departure events (scheduled by the FCFS process itself) 
denote jobs completing service. The actions executed by the server 
process for each event type are shown in figure 6. NJobs indicates the 
number of jobs currently residing in the server, and ServiceTime indicates 
the time required to service each job. Code for computing statistics is not 
shown. 


The classical server process has very poor lookahead properties. This is 
because it will not transmit an arrival event message with timestamp TS 
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Figure 3. Overhead with non-uniform lookahead — deadlock recovery. 
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Figure 4. Overhead with non-uniform lookahead — deadlock avoidance. 
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until it has first advanced its local simulated time clock to TS by 
processing a departure event. In effect, it has a lookahead value of zero. 


The lookahead properties of the FCFS process can be improved by 
eliminating the departure event, and generating a new arrival event as soon 
as one is received. Because an FCFS queueing discipline is used, the 
departure time can be determined as soon as the message is received. The 
optimized program is shown in figure 7. EndService denotes the time at 
which the server process will become idle if no additional jobs are 
received in the future. This program exhibits very good lookahead abilities 
because it can schedule events far into the simulated time future. 


6.1 Performance Using Identical Servers 


Simulators using each of these server programs were developed and 
executed on the Butterfly testbed. In all of the experiments described 
below, each logical process was mapped to a separate processor, and static 
scheduling was used. Service times for server processes were selected 
either deterministically or from a random variable with a negative 
exponential distribution. 


The resulting speedup and simulator efficiencies for the central server 
queueing model using the deadlock detection and recovery strategy are 
shown in figures 8 and 9, respectively. The deadlock avoidance simulator 
yielded similar speedups. As can be seen, reprogramming the server to 
have better lookahead characteristics dramatically improves performance. 
Speedup is improved by as much as an order of magnitude. These results 
are consistent with those obtained using synthetic workloads. 


The performance results of the classical server process are qualitatively 
similar to those reported by Reed and Seethalakshmi. The serve''s used in 
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Figure 5. Central server queueing model. 


ARRIVAL EVENT at TIME T: 
NJobs := NJobs + 1; 
IF (NJobs = 1) THEN /* if server was previously idle */ 
Schedule (local) Departure Event at time T + ServiceTime; 


DEPARTURE EVENT at TIME T: 
Schedule (remote) Arrival Event at time T; 
NJobs := NJobs - 1; 
IF (NJobs > 0) THEN /* if job(s) waiting in queue */ 
Schedule (local) Departure Event at time T + ServiceTime; 


Figure 6. ‘‘Classical’’ program for FCFS server (poor lookahead). 


ARRIVAL EVENT at TIME T: 

IF (T < EndService) THEN /* if server busy */ 
BEGIN 
Schedule (remote) Arrival Event at time EndService+ServiceTime; 
EndService := EndService + ServiceTime; 
END 

ELSE /* server idle */ 
BEGIN 
Schedule (remote) Arrival Event at time T + ServiceTime; 
EndService := T + ServiceTime; 
END 


Figure 7, Optimized program for FCFS server (good lookahead). 


those studies are a variation of the classical server described above, and 
share the same (poor) lookahead properties — a message will not be 
forwarded until another message is first received with a timestamp at least 
as large as the departure time of the first. Therefore, lookahead provides 
an explanation for the poor performance that they observed. 


Although the above results are encouraging, it is important to keep in 
mind that reprogramming the application to exhibit greater lookahead 
ability is not always possible. The above optimization relied on the servers 
using an FCFS scheduling discipline. As we shall soon see, many 
applications inherently contain poor lookahead properties. 


Finally we note that, at first glance, reprogramming logical processes to 
maximize lookahead may complicate other aspects of the simulation, e.g., 
statistics collection. For example, the optimized server does not pause for 
departure events, so statistics that are most easily collected at job departure 
must be collected at other points in simulated time. This problem is easily 
reconciled by scheduling local departure events (as was done before) that 
are only used for statistics collection purposes. 


6.2 Performance Using Mixed Servers 


Additional experiments were performed to examine the effect of mixing 
processes with poor and good lookahead characteristics. Recall that 
experiments using synthetic workloads revealed that a small number of 
processes with poor lookahead could significantly degrade performance of 
the deadlock detection and recovery simulator. The deadlock avoidance 
simulator was found not to be as susceptible to such behavior. 


The central server queueing network simulations were repeated where 
one of the three servers was implemented using the classical server 
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Figure 8. Speedup of central server queueing model. 
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Figure 9, Overhead of central server queueing network simulator. 
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program described earlier, and the remaining servers used the optimized 
program. The resulting simulator is not unlike one that would result if one 
of the servers was (say) a prioritized queue while the others were FCFS. 


The speedup and efficiency of the deadlock detection and recovery 
simulator is shown in figures 10 and 11. When the central server (the 
process receiving messages from the merge process) has poor lookahead 
properties, performance is almost as poor as when all of the servers have 
poor lookahead. When one of the secondary servers (the servers receiving 
messages from the fork process) has poor lookahead, performance is 
better, but still well below that of the simulator using only optimized 
servers. These results are consistent with those obtained using synthetic 
workloads, and demonstrate that a few processes with poor lookahead can 
Significantly degrade overall performance in the deadlock detection and 
recovery simulator. 


When the classical program was used to implement a secondary server, 
the routing probabilities in the fork were modified so that 10, 50, and 
finally 90 percent of the message traffic was routed to the classical server. 
It is interesting to note that performance improves as more traffic is routed 
toward the server with poor lookahead. If little traffic is directed toward 
this server, the simulator is constantly deadlocking because the merge 
process is forced to block because it cannot determine whether or not it is 
safe to proceed without first receiving a message from this server. Routing 
additional message traffic toward this server helps the simulator to 
overcome (somewhat) the server’s poor lookahead characteristics. 


38 


Speedup of Central Server Network 
Deadlock Detection and Recovery 


Speedup 
4 


e Classical central server, det. serv. tm. 

o Classical central server, exp. serv. tm. 

© Classical sec. server (10 percent traffic) det. serv. tm. 

x Classical sec. server (10 percent traffic) exp. serv. tm 
3+ V Classical sec. server (50 percent traffic) det. serv. tm. 

* Classical sec. server (50 percent traffic) exp. serv. tm. 

+ Classical sec. server (90 percent traffic) det. serv. tm. 

t Classical sec. server (90 percent traffic) exp. serv. tm 


16 


64 
| Message Population 
Figure 10. Speedup of detection and recovery simulator with one classical server. 
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Speedup and overhead curves for the deadlock avoidance simulator are 
shown in figures 12 and 13. The deadlock avoidance simulator tends to be 
more forgiving of processes with poor lookahead. Poor performance 
results when the central server process has poor lookahead. However, 
performance begins to approach that of the optimized simulator in some 
Situations where one of the secondary servers has poor lookahead. In 


‘particular, good performance is obtained if a significant fraction of the 


message traffic (50 to 90 percent) is routed around the process with poor 
lookahead. Unlike the deadlock detection and recovery simulator, null 
message traffic is generated by the classical server to allow the merge 
process to proceed. Because processes with poor lookahead tend to buffer 
messages rather than immediately forwarding them, it is best to minimize 
the amount of traffic routed to the classical server because this only 
detracts from the available parallelism. 


7. Communication Network Simulations 


Simulations of the message passing subsystem of a hypothetical 
multicomputer were also performed. The multicomputer is organized in a 
hypercube topology, and Sullivan’s algorithm is used to route messages to 
their respective destinations [Sull77a]. Like the queueing network and 
synthetic workload experiments, a fixed message population was used to 
control the amount of available parallelism. Initially, each message is 
assigned a destination to which it is to be routed, and a message length. 
The destination is selected from a uniform distribution (excluding the 
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Figure 12. Speedup of deadlock avoidance simulator with one classical server. 


processor where the message initially resides), and the message length is 
selected from an exponential distribution. When a message reaches its 
final destination, a new destination and message length are selected. All 
communication links in the hypercube are assumed to provide the same 
bandwidth. Three simulators were developed that contain varying degrees 
of lookahead, as will be described next. 


7.1 A Simulator with High Lookahead 


FCFS is a simulator in which messages are simply forwarded on the 
output link selected by the routing algorithm in FCFS order. Like the 
FCFS queueing network described earlier, this simulator has great 
lookahead ability because messages arriving at a logical process (with 
timestamp denoting the arrival time in the hypercube) can be immediately 
forwarded. 


7.2 A Simulator with Moderate Lookahead 


PRIO is a simulator with intermediate lookahead properties. Here, 
messages are classified as either high priority or low priority. 
Communication links in the hypercube give preference to high priority 
messages when selecting the next message to be transmitted. A low 
priority message is only forwarded if there are no high priority messages 
waiting to use the link. Messages within each priority level are processed 
in FCFS order. Each message is assigned a new priority whenever a new 
destination address and message length are selected and maintains this 
priority until it reaches the destination processor. 


No preemption occurs in this simulator. Once the link begins 
forwarding a low priority message, it will continue to send it, even if a 
high priority message arrives before transmission is complete. 


The parallel simulator for this system has intermediate lookahead 
properties. Logical processes have excellent lookahead for high priority 
messages, but poorer lookahead for those with low priority. Just as is the 
case for the FCFS simulator, high priority messages can be forwarded as 
soon as they arrive because the departure time can be immediately 
determined. However, a low priority message cannot be forward until 
simulated time in the logical process has advanced to the departure time 
(the time the hypercube begins sending the message) because it must first 
be determined that no high priority message will receive service ahead of 
it. 


7.3 A Simulator with Poor Lookahead 


The third simulator, PREEMPT, is identical to the PRIORITY simulator 
except that high priority messages preempt service of low priority 
messages. When a low priority message is preempted, it is assumed that 
the message must be completely resent once no other high priority 
messages remain that are waiting to use the link. The simulator for this 
system cannot forward a message to another logical process until simulated 
time has advanced to the arrival time (the time the tail of the message 
reaches the receiving hypercube node), so it has even poorer lookahead 
properties than the preceding simulator. 
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Figure 13. Overhead of deadlock avoidance simulator with one classical server. 
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7.4 Performance Results 


The hypercube simulations were performed on the Butterfly, and 
compared with execution of the sequential event list implementation. 
Unlike the previous experiments, these were performed on the Butterfly 
Plus, an upgraded version of the Butterfly that features 32 bit data paths 
(the original Butterfly has 16 bit data paths). The switch remains the same, 
so this effectively increases the cost of interprocessor communications. 
Because the simulation testbed already minimizes interprocessor 
communication, no program modifications were required. Experiments 
indicated that this hardware modification did not significantly affect the 
speedup measures derived earlier. 


Overhead for these three simulators is shown in figures 14 and 15 for 
hypercubes of dimensions 4 and 6 (16 and 64 nodes respectively). Eight 
processors were used in these experiments. Upon reaching its destination, 
each message is assigned a high priority with probability P,,,;,. In these 
experiments, Pyprio Was selected to be either 0.01 or 0.50. 


As predicted, the observed overhead steadily increases as the lookahead 
properties of the simulation are diminished. This is reflected in higher null 
message ratios in the deadlock avoidance simulator, and a larger message 
population required to induce avalanche in the detection and recovery 
simulator. Overheads are generally lower in the dimension four hypercube 
than the cube of dimension six for a fixed message population (as 
measured in messages per process) because there are fewer 
communication links; the simulators operate at peak efficiency when there 
is at least one message on each incoming link because no blocking occurs. 


The lookahead properties of the simulator increase as P,,,;. increases 
because more high priority messages are generated that can be forwarded 
as soon as they are received. This explains the lower overheads that were 
observed when Prpyio Was increased. 


Speedup curves for the hypercube simulators are shown in figures 16 
and 17. Using eight processors, the parallel simulator executed anywhere 
from 5.7 times faster to nearly 20 times slower than the splay tree 
simulator, depending on the lookahead properties of the application. Some 
data points for very high message populations are missing because 
insufficient memory was available on a single processor to conduct an 
event list simulation. 


The hypercube simulations provide additional evidence to support our 
contention that lookahead properties of. the application are crucial to 
obtaining efficient performance for simulators using the deadlock 
avoidance and deadlock detection and recovery strategies. While the 
queueing network simulations demonstrated that it is possible to obtain 
dramatic speedups by reprogramming the simulation to fully exploit its 
lookahead properties, these experiments demonstrate that some simulations 
inherently contain poor lookahead, and cannot be improved by 
reprogramming. Such simulations appear to be poorly suited for the 
conservative simulation algorithms using deadlock avoidance and 
deadlock detection and recovery techniques, except in a few special 
circumstances such as networks that contain no feedback loops. 
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Figure 14. Overhead in hypercube simulator using deadlock recovery. 
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Figure 15. Overhead in hypercube simulator using deadlock avoidance. 


8. A Perspective on Lookahead: Non-Events 


The influence of lookahead on performance can be viewed from another 
perspective: processes with very good lookahead ability are able to act ina 
largely autonomous fashion; their behavior is not heavily influenced by the 
activities of other processes, so they can perform simulation work at ‘‘full 
speed,’’ limited only by the rate at which they can be fed work, and the 
number of CPU cycles (or other resources) that they can obtain. The 
optimized queueing network server process is a good example of such 
autonomous behavior. 


On the other hand, processes with poor lookahead ability must 
frequently obtain additional information from other processes before they 
can safely proceed. This is unfortunate because not only must such 
processes wait for real events to be generated by other processes 
(corresponding to data dependencies that cannot be circumvented), but 
often they must also wait to be sure other events will not occur. The fact 
that an airplane will not crash and close the airport in the next moment of 
simulated time must be discovered before the airport process can go about 
its business of deciding what will happen next. We call these ‘‘phantom’’ 
events that never materialize non-events. Chandy and Misra recently 
captured these notions in an elegant formalism called conditional and 
unconditional knowledge [Chan87a]. 
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Figure 17. Speedup of hypercube simulator using deadlock avoidance. 


In the deadlock avoidance simulator, knowledge of non-events is passed 
explicitly through the use of null messages. In the deadlock detection and 
recovery simulator, this information is obtained by system deadlock — 
processes with messages waiting to be processed must wait until they can 
be certain that specific events will not occur. Certainty as to the 
eventuality of non-events comes about when the deadlock is broken, and 
the deadlock resolution protocol is invoked. Sequential, event list 
simulators incur little or no overhead for non-events. 


If non-events are possible, but occur infrequently, the simulator is often 
forced to wait needlessly, leading to very poor performance. The 
hypercube simulator containing preemption and few high priority messages 
is one example of such behavior. Optimistic simulation methods such as 


Time Warp appear to offer the greatest potential for addressing this 


problem, if the associated state saving and rollback overheads can be 
overcome. 


9, Conclusions 


Extensive empirical performance evaluations of distributed simulation 
programs were performed using the deadlock avoidance and deadlock 
detection and recovery algorithms developed by Chandy and Misra. The 
principal results of these studies are: 


e The lookahead ability of logical processes plays a critical role in 
determining the efficiency of the deadlock avoidance and deadlock 
detection and recovery algorithms. This is attributed to the fact that 
processes must spend an excessive amount of time waiting to be sure that 
certain events will not occur if their lookahead ability is poor. 


e Message avalanche was observed in the deadlock detection and recovery 
simulator for moderate to high message populations, and was necessary 
to achieve efficient execution. The poorer the lookahead ability of a 
process, the larger the message population necessary to achieve 
avalanche. If lookahead is sufficiently poor, avalanche may never be 
observed for workloads of practical interest. 


e Deadlock detection and recovery simulators containing different types of 
logical processes can be adversely affected by a small number of 
processes that exhibit poor lookahead ability. The existence of a few 
such processes can greatly increase the message population necessary to 
achieve avalanche, even if many other processes contain very good 
lookahead properties. The deadlock avoidance simulator is not as 
severely affected by this behavior if the bulk of the simulation activity 
avoids processes with poor lookahead. 


e Queueing networks that contain cycles, previously thought to be ill- 
suited for conservative distributed simulation algorithms, can achieve 
good performance if servers are reprogrammed to take advantage of all 
available lookahead. 


e Simulation applications such as those containing infrequent preemptive 
events inherently have poor lookahead properties, and appear ill-suited 
for these algorithms. Applications containing state dependent behavior 
(e.g., load balancing mechanisms) similarly contain moderate to poor 
lookahead properties. 


e Simulations of several hypercube-based communication networks with 
varying degrees of lookahead provide empirical data to support the above 
conclusions. 


These studies demonstrate that parallel simulation algorithms can 
achieve significant speedups over sequential event list implementations if a 
moderate to high degree of parallelism is present, even if there are many 
feedback loops in the logical process topology. However, good lookahead 
properties are essential to obtaining good performance in simulations using 
deadlock avoidance or deadlock detection techniques. The fact that a few 
processes with poor lookahead properties can significantly degrade 
performance also limits the usefulness of these approaches. 


Because conservative simulation algorithms must continually predict 
what will not happen in order to be able to safely proceed, these studies 
raise considerable doubt as to whether any conservative parallel simulation. 
algorithm can obtain significant speedup in applications containing poor 
lookahead properties. In these situations, optimistic simulation algorithms 


such as Time Warp appear to offer much greater potential for achieving 
significant speedups. 
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A BLOCKED JACOBI METHOD FOR THE 
SYMMETRIC EIGENPROBLEM? 
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Abstract — A block matrix generalization of the Jacobi rotation 
method for computing the eigendecomposition of a symmetric ma- 
trix is presented. This Blocked Classical Jacobi (BCJ) algorithm 
selects for block rotation at each step the off-diagonal block(s) of 
largest mass. The BCJ algorithm exhibits substantially shorter 
runtimes than other Jacobi-like methods, even though it performs 
more work per iteration. A probabilistic analysis of the BCJ selec- 
tion method is presented. Timings and other data are presented 
from experiments on random matrices. 


1 Introduction 


The class of Jacobi rotation methods [4,7,10,12] for computing the 
symmetric eigenvalue decomposition 


A=UDU" (i) 
of an nxn real matrix A, where U is orthogonal and D is diagonal, 
has generated substantial interest in recent years, particularly in 
the context of parallel computer architectures. Algorithms have 
been developed for systolic processor arrays as well as for more 
general purpose parallel computers. These methods differ princi- 
pally from the original method of Jacobi in that they choose a 
fixed sequence of matrix elements for the necessary orthogonal ro- 
tations. Jacobi’s method performs a rotation to zero out the largest 
off-diagonal element at each step; the sequence of rotations is data- 
dependent. 

This paper presents a novel block matrix or “hypermatrix” adap- 
tation [2,3,16] of the original algorithm, which we label the Blocked 
Classical Jacobi (BCJ) algorithm. The matrix A is treated as a 
smaller mxm matrix of bx b submatrices; computations work on en- 
tire submatrices rather than on scalars. Furthermore, the sequence 
of submatrices to be rotated is chosen to locally maximize the re- 
duction of A to diagonal form by selecting the off-diagonal blocks 
of largest mass. BCJ reduces serial runtimes compared with other 
Jacobi methods and thus may prove useful where Jacobi methods 
are preferred over other eigensolvers, such as the QR algorithm. 

For computers with a hierarchical memory system, in which 
successively larger yet slower memories are located at increasing 
distances from the arithmetic processor, many numerical calcula- 
tions are efficiently structured in terms of block algorithms [3,8,15]. 
Rather than computing with scalar quantities, block algorithms 
operate on small square or rectangular submatrices of data. The 
resulting “surface-to-volume” effect of a single block data transfer 
followed by several computations allows a fast processor with local 
memory to achieve nearly full utilization even when supplied by a 
significantly slower bus or main memory. 

The blocked organization of BCJ reduces the overhead cost of 
determining the maximum off-diagonal elements. It also makes 
BCJ especially well-suited for implementation on multiprocessors 
with a hierarchical memory system (e.g., [8]). As well, BCJ is 
suitable for parallel implementation. 

The organization of the paper is as follows. Section 2 gives a 
brief review of serial Jacobi methods for the symmetric eigenprob- 
lem. Section 3 gives the motivations for BCJ and presents the 
algorithm as implemented in this study. Section 4 lays out the 
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numerical experiments with BCJ, including timings and numbers 
of iterations to convergence. Section 5 presents the analysis of the 
block selection method using the theory of order statistics, and 
discusses the implications for the experimental data. Concluding 
remarks and indications for parallel implementations are presented 
in section 6. Section 7 contains the proofs of two probabilistic 
results from section 5. 


2 Review of Serial Jacobi Methods 


The Jacobi method of solving (1) constructs a sequence of orthog- 
onal rotations U, = U(61, 11,J1; A), U2 = U(62, 12, J2; AM), sey 
such that U = U,U,--- diagonalizes A (that is, U7 AU is diag- 
onal), 0 < 63 < a/4, and lim;_,.. 6; = 0. In practice the com- 
putation is terminated after a finite number of rotations, leaving 
U = U,U2...,Un. The rotation U, is selected to zero out the 
matrix elements in positions (i,,j,) and (jy, iy). 

Given (i,j) = (tv, jv), the rotation angle 0, is computed so that 
AM) = UT AY-DU,, according to 


ay? as; a ( Cy ai ay” i? ( Cy ie 
a) a‘) —Sy Cy van al) —s, cy]’ 
| (2) 
with ay) = a) = 0; here A) = A. The cosine c, and sine s, of 
the angle 6, may be calculated by [9] 


r= (al) — a) faa), a £0, 


(3) 


then solving for ¢ in 


+r 1 (4) 


(: ___ sign(r) ) 
[+ vit? 
and substituting in 


= (1 + eo ils a a (5) 
U, is set to the identity matrix, except in rows and columns 7, and 
jy, where it is zero everywhere but in the 2 x 2 principal submatrix; 


Cy § 


there it is ‘ ). If a?) = 0 then c, is set to 1 and s, to 


—Syp Cy 
0, for 0, is obviously 0. 

There are several methods for choosing the rotation index pair 
(i,j). The classical Jacobi method selects (7,7) at each step to lo- 
cally minimize the resulting off-diagonal Frobenius norm by choos- 
ing (%, 7) as the location of the largest off-diagonal element. How- 
ever, the effort of determining the location of the maximum ele- 
ment (O(n?) operations) exceeds the work in calculating and ap- 
plying the orthogonal rotation U, (approximately 18n operations 
neglecting symmetry). For this reason the method is rarely used 
on computers. 

The cyclic-by-rows ordering of elements ((i,j) = (1,2), (1,3), 
..., (1,7), (2,3), ...,'(2,n), ..., (2 — 1,n)) is more amenable to 
automatic computation. However, the successive index pairs are 
almost always dependent (sharing a row or column), and thus not 
suited for parallel computation. Parallel orderings have featured 
other index pair selections chosen for data locality and utility on 
a systolic processor. The Brent-Luk and Sameh orderings [4,13] 
have many desirable features. They preserve data locality and 
are amenable to systolic or other parallel implementations, they 


converge faster than the cyclic-by-rows ordering, and they rotate 
each off-diagonal element exactly once in a “sweep.” A particularly 
useful feature is that at each step, the n/2 independent rotations 
(operating on n/2 mutually distinct pairs of rows and columns) 
may be carried out simultaneously. 


3 Algorithm BCJ 


We now develop a blocked analogue of the classical Jacobi al- 
gorithm for the symmetric eigenproblem (1) that performs more 
work in selecting the index pairs yet requires less run-time than 
a blocked Brent-Luk ordering. BCJ also generalizes to computa- 
tion of the singular value decomposition of a rectangular matrix. 
The new Blocked Classical Jacobi (BCJ) method selects the largest 
off-diagonal block(s) for rotation, in order to locally minimize the 
off-diagonal mass of A. Through a suitable choice of the block size, 
the extra computations to determine the off-diagonal block of max- 
imum mass are offset by a reduced number of iterations; BCJ is 
more efficient on a serial computer than the other Jacobi methods 
tested. 


BCJ is also highly parallel in nature. Where several processors 
are available to solve a single eigenproblem, the K > 1 largest 
independent off-diagonal blocks may be selected for simultaneous 
rotations, leading to a straightforward parallel implementation. 


At each iteration, BCJ selects an off-diagonal block submatrix 
(i, 7) for rotation and computes a block orthogonal rotation matrix 
U,, which it then applies to help reduce A to block diagonal form. 
The block orthogonal rotation can be chosen as a sequence of scalar 
Jacobi rotations or from the eigendecomposition of the small block 
matrix; we use a full scalar Sameh sweep on the small block matrix. 
(However, there is no restriction that the small block matrix must 
be diagonalized, only that its off-diagonal mass be reduced. Com- 
putations by Bischof [1] on the SVD indicate that the extra effort 
of completely diagonalizing the block matrix at each step may be 
wasted.) The method then proceeds by selecting another block ele- 
ment of A to rotate. A final processing step of Sameh sweeps forms 
the eigenvalues and eigenvectors from the block diagonal elements 
of A. On an m x m block matrix, a BCJ “sweep” is (m2 — m)/2 
two-block by two-block rotations. 


The precise block algorithm for carrying out BCJ to compute 
the symmetric eigendecomposition (1), with D overwriting A, is 
as follows. Assume, for ease of exposition, that the block size 6 
divides the matrix size n exactly, so that n = mb. K > 1 indepen- 
dent off-diagonal blocks may be selected for simultaneous rotation. 
The iterations continue until a tolerance criterion TOL is met. The 
method begins with U = J, vy = 0, and continues 


1. Compute the squared masses {M;; a ene with 


b 2 
Mis = > (aayneeci-nees) 


2. Select K independent rotation pairs (t%,j,),1<k < K with 


Minin a max( Mj; |i g {21, see th-1}5J g {j1, see SIRaa) 


3. Compute K block rotations {U0 (Ox, tk, je, A™))\}_ | to reduce 
the block off-diagonal mass of A (as indicated below). 


4. Apply the block rotations of step 3 to U and to A), forming 
Alv+1), 


5. If the block off-diagonal mass of A+?) is not less than TOL 
times the block diagonal mass, then set vy := y + 1 and go to 
step 1. 


6. Diagonalize the diagonal blocks of A‘) (until the off-diagonal 
mass is less than TOL times the diagonal mass) and update U. 
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Step 3 of our BCJ implementation uses a single scalar Sameh 
sweep to reduce the off-diagonal mass of the two-block by two- 
block submatrix. This sweep includes b(2b — 1) point wise rotations 
performed sequentially. Step 6 uses successive Sameh sweeps to 
diagonalize the block diagonals of A‘). 

The BCJ algorithm is to be compared against the “block Brent- 
Luk” algorithm, which omits step 1 and replaces step 2 by selecting 
m/2 block index pairs according to the Brent-Luk ordering. A 
block Brent-Luk sweep also involves (m? — m)/2 two-block by two 
block rotations. It is important to note that the two methods under 
comparison differ only in their index selection methods. 


4 Experimental Results 


Several numerical experiments were conducted to compare the 
efficiency of BCJ and blocked Brent-Luk symmetric eigensolvers 
on matrices of random data. The test matrices were generated as 
matrices of uniform random deviates from (0,1]; in each case 10 
tests were run to give non-parametric error bounds to within 10%. 
All computations were carried out with a tolerance of TOL = 1078. 
Table 1 summarizes the run times of the two methods on problems 
with various values of n, m, b, and k. (Comparable average Eispack 
times from TRED2/TQL2 are 1.46, 0.26, and 0.05 seconds for n = 
64, 32,16, respectively.) Figures 1-4 display iteration counts and 
relative efficiencies of the two algorithms. 

These experiments show that the extra work of finding the largest 
independent off-diagonal blocks is offset by faster algorithmic con- 
vergence of BCJ, which makes the present method competitive with 
other blocked Jacobi rotation techniques. While, the QR algorithm 
is obviously superior on these random, dense matrices, its advan- 
tage will be reduced on nearly diagonal or quite sparse problems. 


BCJ execution times Blocked Brent-Luk times | 


BCJ BCJ BCJ B-L B-L B-L 
N B|{ MIN AVG MAX | MIN AVG MAX 
2 26.93 28.68 79.39 136.12 


22.11 25.48 29.98 | 30.02 63.13 146.09 
31.48 39.97 66.47 | 38.19 145.70 424.32 
107.90 253.24 242.68 513.49 


32 
32 
32 


16 

16 
Table 1: Multiflow Trace/7 execution times (sec.) for BCJ, blocked 
Brent-Luk, TOL = 1078, 10 trials. 
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5 Algorithmic Analysis 
An important factor in determining the efficiency of the algorithm 
is the block size. BCJ has the following computational work per 


iteration (n = mb): 


Step Computations 


L: 2n7 

2 O(m?logm) 

3 6K(2b)(2b — 1)/2 

4 18nK(2b)(2b — 1)/2 
5 =2n? 


6 2n?b + O(nb?) 


‘The work for step 1 is actually completed in step 5, where the 
block masses are computed, so that after the first iteration step 1 
contributes no work. Step 2 can be done in O( Km?) operations, 
which is an improvement if K = o(logm). Step 6 is performed once 
at the end of the calculation and has asymptotically negligible work 
if 6 = o(n); for very large b step 6 dominates the total work. 

A moderate value of b should be preferable in order to maximize 
the sum of off-diagonal block masses. Indeed, Figure 1 reflects this 
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Figure 2. Average BCJ sweeps. IML = 100-8. K=n/2b 

behavior. For small b, the overhead of determining the largest block 
exceeds the work of diagonalizing A. As 6 increases, the maximal 
off-diagonal squared block mass will approach the average block 
mass, reducing the effect of each block rotation, and consequently 
lengthening BCJ computations. Figure 3 shows that for several 
matrix sizes n, increasing the block size b increases monotonically 
the number of sweeps of BCJ, as expected. Furthermore, the ex- 
ample of Figure 5 indicates that with relatively few blocks, two 
large off-diagonal masses are likely to be dependent. 

We now examine BCJ’s behavior with a brief review of relevant 
order statistics theory [6,11], which describes the behavior of sorted 
random variates in terms of the probability distributions of the in- 
dividual elements. Given independent and identically distributed 
random variables X,, Xo, ..., Xn, the N order statistics Xin) 


X3.N) +++) XN y are the random variables associated with the low- 
est ranked to highest ranked X;. 


A particular instance of the theory is instructive with regard to 
BCJ, which starts with the (n? —n)/2-sized upper triangular array 
from an n X n symmetric matrix A of uniform random variates. 
Let Y; be one of (n?—n)/2 iid uniform variates on the interval (0, 1], 
and set X; = Y;?. Then an average off-diagonal element of A has 
squared mass E[X;] = 1/3, while the maximum has mean squared 
mass PIM (na-nyfa(n2—n)(2 = 1-2/(n? —n+ 2). Selecting the 
maximum off-diagonal element, rather than an average element, 
increases the reduction to diagonal form of an individual rotation. 

Assuming further that each X;, 1 < i < (m?* — m)/2, is dis- 
tributed as the sum of b? squares of uniform random values from 
the interval (0, 1], as is Mj; in the first step of our blocked ex- 
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Figure 4. Average number of BCJ sweeps. n = 64. TOL = 100 8 
periments, it is clear that the central limit theorem applies to the 
block mass distributions. For large 6, one may represent the off- 
diagonal squared block mass as a normal random variable with 
mean p = b?/3 and variance o? = 467/45 (corresponding to the 
sum of 6? uniform random variables). 


The maximal order statistic for these large blocks tends toward 
a standard limiting distribution, from which we may determine the 
moments. Although the example employs sums of uniform vari- 
ates, the proposition holds for any blocks that have asymptotically 
normal squared mass. 


2 
Proposition 1. Let fee le be ad normal variates with 
mean yp and variance o?. In the limit as m — oo, the expected 
largest vartate ts 


E[X(m2—m)/2,(m2—m)/21 =UtoV 2 log((m? ee m)/2) (6) 


+O (log log m/ Viogm) 


and the variance ts 


(1) on r’(1)? 
* 2° 
Var[X(m2—m)/2,(m2—m)/2] a; 2 log((m? — m)/2) 7) 


+ O (log log m/(log m)?) , 


where I’(-) and I'"(-) are the first two derivatives of the gamma 
function, respectively. (Note that [’’(1) — I’(1)? © 1.64.) 


Thus the expected largest squared mass is about 2,/logm stan- 
dard deviations above the mean, with asymptoticlly vanishing vari- 
ance. Cohen [5] has derived similar results for generalized matrix 
products. The proof of Proposition 1 is left to section 7. 

BCJ operates by maximizing the mass of the selected off-diagonal 

blocks. This works well when the ratio (4 + 20,/log m)/p is large 
while the additional cost to determine the largest block is low. Both 
cost and benefit decrease with increasing blocksize. 
_ For certain values of b, BCJ inherits the fast convergence of 
the classical Jacobi method without paying a large cost for max- 
imal selection. If b is chosen approximately 6 = (logn)*/3 and 
K = m/2, the work of computing and selecting the independent 
maximal blocks is O (n?(log n)\/ =) per iteration, as is the rotation 
work, so that the two are of comparable sizes. For larger block 
sizes b, the block selection cost is asymptotically negligible. If 6 
grows as \/log n, then the largest squared block mass is a constant 
multiple of the average squared block mass, while the extra cost of 
determining the maximal off-diagonal blocks is of smaller order. 

Figure 1 clearly indicates the benefits of choosing a moderate 
blocksize, as the average solution time initially decreases as 6 grows. 
However, the use of a large b produces longer runtimes. 

The selection of K > 1 maximal independent off-diagonal blocks 
(step 2) forms a more complex sum of conditional order statistics, 
which we now examine. Let X;, 1 < i < M4, be izd random 
variates with density f(z) and distribution F(x). Denote by X}7 
the maximal order statistic. Now fix a particular subset of size Mo 
of the remaining variates (excluding the selected maximum and 
others), and let X}7, \M, be the maximal variate in the subset. It 


is clear that X*? <Xiz x1 . Inductively define X}, i from 
Mi, M2 S = 7M, Mii 


XM,,....Mx as the maximal order statistic in a chosen dubset of M41 
variates selected from the previous subset of size Me (excluding 
the previous maximum and others). We call X}? M,,....M, the a 


conditional maximal order statistic of the {Xie 


Proposition 2. For M, > Mz >.---> M, > 0, the probability 
distribution of the k*® conditional mazimal order statistic is 


Pr{XGt sg Se} = sre TT (aatsn) (8) 


t=1 
iyi 


Letting uy; = E[X}j,,] be the unconditional mean of the maxi- 
mum on M; observations, we have 


M; 
ELX} ae Mx] = Yo I Casa : (9) 
i#i 

We briefly indicate the formulation of the first step of BCJ in 
terms of Proposition 2. In BCJ, K maximal off-diagonal blocks are 
selected in K stages from an mx ™m upper triangular array of (m?— 
m)/2 wd random variates. Independence of the selected locations 
requires striking out the row and column of the maximum. At stage 
t, 1 <i < K, the maximal variate will be drawn from a subset of 
(ote) blocks in the strict upper triangle of the array and then 
two rows and columns of the array will be struck out, corresponding 
to the row and column indices of the selected maximal element. 

For instance, in the 6 by 6 example of Figure 5, the first max- 
imum is 10 (row 1, column 4). Thereafter rows and columns 1 
and 4 are struck froif the array (to preserve independence) and 
the second conditional maximum is selected; it is 5 (tow 2, column 
3). Note that larger elements that are dependent upon the first 
maximum may be ignored in the selection of the second maximum. 
Finally, rows and columns 2 and 3 are struck from the array and 
the final maximum of 4 (row 5, column 6) is selected. 


The selection of the K maximal independent off-diagonal blocks 
(which forms the more complex sum of conditional order statis- 
tics discussed above), determines on average a smaller sum of off- 
diagonal masses than K successive iterations choosing the single 
largest block. However, it is observed in Figure 4 that the number 


Figure 5: Conditional maximum selection X(1,4) = Ate = 10; 
X(2,3) = XT8 6° =5 X(5, 6) = Xi8 6,1 — = 4. 


of sweeps to convergence initially declines as K increases. This 
probably reflects the amortization of step 5 costs over additional 
blocks. As expected, BCJ requires slightly more iterations to con- 
verge as K reaches its upper limit of m/2 (e.g., b = 2, 4). 

Figure 2 presents in graphical form the ratios of the average BCJ 
and blocked Brent-Luk execution times on a Multiflow Trace/7 
computer. The efficiency ratio shows the speedup of BCJ, with 
improvements up to a factor of 3.6 due entirely to improved index 
selection. Examination of Table 1 shows that, for almost all cases, 
BCJ runtimes have lower deviations from the mean. 

Asymptotically, b = Q ((logn)?/%) guarantees that the work of 
selecting the maximal blocks will be at most comparable to the 
other arithmetic operations. However, assuming normality of ini- 
tial data and intermediate results, the optimal 6b so that largest 
blocks are substantially larger than average (6? = O(b,/log m?), 
from eq. (6)) is b = O(/logn). For 6 & (logn)*, 1/3 < a < 1/2, 
BCJ should be asymptotically faster than a blocked Brent-Luk 
method. The numerical experiments show speedups for problems 
of moderate size. 

In general, the distribution of the elements of A’) will be more 
complex than described here and the order statistic argument must 
be specialized to include the distributions of intermediate results, 
in order to rigorously prove rates of convergence. However, the 
improved performance of the new algorithm is consonant with the 
analysis performed here. 

In cases where the matrix has few large elements or is close to di- 
agonal, one expects BCJ to acheive shorter runtimes than indicated 
by these experiments on uniform random data. For instance, the 
method may prove useful in adaptive signal processing algorithms 
that rely on eigenvalue decompositions [14]. 


6 Conclusions 


The improved index selection process of BCJ produces a substan- 
tial overall reduction in the program running time, compared to a 
blocked Brent-Luk algorithm. In particular, the extra work of de- 
termining the largest off-diagonal blocks is offset by fewer iterations 
needed for convergence. Furthermore, because the algorithm em- 
ploys blocked data concepts, it is appropriate for computers with 
a hierarchical memory system. The concentration of work on the 
relatively small and numerous block elements is advantageous for 
parallelization of the algorithm. 

The selection of parameters b and K is important to the effi- 
ciency of BCJ. A moderate value of b gives the lowest run times 
(though not the lowest number of block sweeps). The extra benefit 
of increasing K falls off rapidly for small n. 

Nearly all stages of the algorithm are amenable to sficient par- 
allel computation. Step 1 can be computed independently on m? 
processors; step 2 on various combinations of processors and inter- 
connections; step 3 on K large-grained or Kb fine-grained proces- 


sors, depending on whether the block rotation is parallelized or not; 
step 4 on up to bKn processors; step 5 on m? or more processors; 
and step 6 on K or more processors. 

This investigation of BCJ was prompted by the use of a blocked 
Brent-Luk method in the Saxpy Computer Corp. mathematical 
subroutine library. It appears that a BCJ method could be more 
efficient than the current approach. Although Eispack routines 
are obviously quite fast for the dense examples used here, BCJ 
may improve upon the QR algorithm in cases of sufficiently sparse 


symmetric eigenvalue problems. 


7% Calculation of Distributions of the Maximal 
and k‘* Conditional Maximal Order Statistics 


Proof of Proposition 1. David [6] presents the limiting distribu- 
tional behavior of the maximal order statistic X;, ,,, which depends 
upon the well-known distribution 


A(z) = exp(—e~”) —o<2< 0 (10) 


in the case of iid 0-1 normal variates. We now carry through the 
analysis for general z7d normal variates. 

Let ¢y¢2(2) = (oV27)~! exp(—(x — n)?/207) be the normal den- 
sity function and let ©,,2(x) = i Puo2(y)dy be the normal dis- 
tribution function corresponding to mean yu and variance o”, re- 


spectively. For large z, 
bs 2 
1 — ©,,,2(x) ~ 2 | (11) 
Pyuo2(2) Lr pl 


based on a change of variables from the case p = 0 and o = 1. 
Thus Theorem 9.3.5 [6] applies and 


lim Pr {(X* nn — bn)nbyo2(In) < w} = A(z) (12) 


holds uniformly for every x € (—0o, 00), where I, is selected so that 


®02(In) =1—1/n. 
According to (11), 


1 o 1 ee _ ae) 
ai =l1- ®.92(In) ~ en exp O62 : (13) 


The asymptotic form of /, is then 


] 
ree ( V2I0n - eed 4 O(1/logn). (14) 


Using the relation ndyo2(In) = (In — H)/o? + O(I5'), we see that 


lim Pr {(X* —In)(In — B)/o? < a} = A(z). - (15) 
_ It follows directly from (15) that 
! 2 
BEG] = nt TO 4 oh (16) 
n 


ito /2logn+O (log log n//logn) (17) 


_ where I’(1) (Euler’s constant) is the mean associated with A(z). 
The variance vanishes asymptotically according to 


4b") - TQ)" 
o ie (18) 

2") - T'()" 
i 2logn 


Var[X},| 


+ O (loglogn/(logn)*) , (19) 


where I’’(1) — I’(1)? is the variance associated with A(x). (Here 
we have used I’(-) and I'’’(-) to represent the first two derivatives 
of the gamma function, respectively.) O 


Proof of Proposition 2. Let {Xh,.m, = 2|X;<y} denote the 
event that the maximal order statistic on My, observations is 2, 
conditioned on all observations X; in the subset of size M; be- 
ing bounded above by y. Then the density of the k‘" conditional 
maximal order statistic obeys the relation 


Pr{Xih om, = 2} (20) 


0o 
= | Pr{ X$f—* ay, = 9} Pr {Xda an, = 1X <u} dy, 
x 0 


where 


Mi. 
Pr{ Xtina = 21Xi <4} - | & (#8) , ~00< BS y <0 
0, , —oO<y< <0 
(21) 
is the probability distribution of the maximal order statistic on M; 
bounded observations. | 
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Define Vz(x) = Pr{X3k ou, <2}. Then Vi(z) = F(x)™. 
Inductively assuming that V, (x) = Se ai, F(x): gives a recur- 


rence relation on the a;, of 


Mi 
pS 6p) 1<i<k, 2 
dip a; 1, = Mi <i< (22) 
and : 
M; 
akk = 2, Mb a’ (23) 
where ai; = 1.. Consideration of the k — 1 degree Lagrangian 


polynomial interpolating the points (M;,1),1<i<k, establishes 
that : 
M; 
Aik = II (54s) Fi (24) 
The distribution is thus 


k k 
* ; M; 
Pr (XH, un S2}= POM TT (Gar). 9 es) 
i=1 j=l J ; 
jFt 
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Abstract 


A new parallel Jacobi-like solution method for the 
singular value decomposition (SVD) is presented which 
is optimal in achieving both the maximum concurrency 
in computation and the minimum overhead in commu- 
nication. Unlike previously published parallel SVD al- 
gorithms based on a nearest neighbour ring topology for 
communication, the new algorithm introduces a recur- 
sive divide-exchange communication pattern. As a re- 
sult of the recursive nature of the algorithm, proofs are 
given to show that it achieves the lower bounds both in 
computation and communication costs. In general, the 
recursive pairwise exchange communication operations 
of the new algorithm can be efficiently supported by 
multiprocessors with interconnect patterns used in many 
networks that have been proposed to support large-scale 
parallelism. As an example, this paper illustrates that 
the new algorithm can be mapped efficiently and natu- 
rally onto hypercube architectures. Preliminary results 
with an implementation of the new algorithm are re- 
ported. Convergence aspects of the new algorithm are 
briefly discussed. A comparison with related work is 
outlined. 


1 Introduction 


Rapid technological advances in multiprocessor architectures 
have aroused much interest in parallel computation. Parallel 
methods to compute the singular value decomposition (SVD) 
have received attention due to its many important applications 
in science and engineering. A recent paper by Heath et al [8] 
includes a history of various Jacobi—like SVD algorithms. 

An early investigation into parallel computation for the 
symmetric eigenvalue problem, on the SIMD Iliac IV is de- 
scribed by Sameh in [18]. Sameh outlines the criteria for max- 
imal parallelism in a Jacobi-—like algorithm. More recently, a 
number of authors including Berry et al [1] advocate the one- 
sided SVD of Hestenes [9], [8], [15] for parallel computation 
of the SVD. Luk and his co—workers have examined various 
systolic array configurations to compute the SVD [12], [3], [4]. 
Brent and Luk [4] have invented a linear array of n/2 proces- 
sors which implements a one-sided Hestenes algorithm, that 
in real arithmetic, is an exact analogue of their Jacobi method 
applied to the eigenvalue problem. The array requires O(mnS) 
time, where S is the number of sweeps (typically < 10). Brent 
and Luk demonstrate that their algorithm is computationally 
optimal in the sense that it requires the minimum number of 
computational steps per sweep i.e. n — 1, to ensure the exe- 
cution of every possible pairwise column rotation. Maximum 
concurrency is maintained throughout the computation. Their 
systolic array is comparable to the architecture of a nearest— 
neighbour linear array of processors, where communication is 
based on a ring topology. 

Brent and Luk’s algorithm is not optimal in terms of com- 
munication overhead. Unnecessary costs are incurred by map- 
ping the systolic array architecture onto a ring connected linear 
array due to the double sends and receives required between 
pairs of neighbouring processors. Eberlein [5], Bischof [2] and 
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others have proposed various modifications for hypercube im- 
plementations, which require the embedding of rings via binary 
reflected Gray codes. 

In this paper, we present a new parallel Jacobi-like solu- 
tion method for the SVD which is optimal in achieving both 
the maximum concurrency in computation and the minimum 
overhead in communication. Unlike previously published paral- 
lel SVD algorithms based on a nearest neighbour ring topology 
for communication, the new algorithm proposed in this paper 
introduces a recursive divide—exchange communication pattern. 
As a result of the recursive nature of the algorithm, proofs 
are given to show that it achieves the lower bounds both in 
computation and communication costs. Convergence aspects 
of the new algorithm are briefly discussed. The paper illus- 


trates that the new algorithm can be mapped efficiently and 
naturally onto hypercube architectures. We have implemented 
the new algorithm on the Intel hypercube through simulation 
and the preliminary results will be discussed. A comparison 
with related work is briefly outlined. We believe that the new 
algorithm can be efficiently mapped onto multiprocessors with 
interconnection patterns that have been proposed to support 
large-scale parallelism such as the many PM2I—based or cube-— 
based networks [20]. 


2 Jacobi-like Algorithms 


2.1 The Singular Value Decomposition 


The singular value decomposition (SVD) of a general non- 
square matrix may be given as follows, 


Theorem 2.1 For a real matriz A(m x n) of rank r, there 
exists orthogonal matrices U(m x m) and V(n x n), such that 


UT AV == diag(o1,02,° -) > 0, 
where the elements of (mx n) may be ordered so that 


Oy > 02 > °° > Oy > Ongg = =O, =0, g=min{m,n}. 


If m =n, % is a square diagonal n x n matrix [11]. 

In order to compute the SVD in an iterative fashion, a se- 
ries of plane rotations may be applied to the matrix A(m x n) 
described in theorem 2.1 above. This approach is similar in 
nature to Jacobi’s original method for computing the eigenval- 
ues of a symmetric matrix where orthogonal matrices J(t, 7, 6) 
are applied so as to annihilate a symmetrically placed pair of 
the n(n — 1) off-diagonal elements. These rotation matrices 
differ from the identity matrix of order n by the principal sub- 
matrix formed at the intersection of the row and column pairs 
corresponding toz and 7. A 2 x 2 submatrix has the form 


c 68s 
—sS Cc 
The cosine and sine of the rotation angle @ are the constants 


c = cos@ and s = sin@. Initially Ay, A and at the k-th 
iteration, 


Apes = J (tks dk 9k)” And (tks des 9x)- 


Rotations are applied simultaneously, in a symmetric fashion 
from the left and right. Cyclic Jacobi methods refer to a se- 
quence of rotations which update row and column pairs in some 
predetermined order. For a square matrix, a cyclic sweep refers 
to the updating of n(n—1)/2 elements. A number of sweeps are 
required in order to effectively reduce the off-diagonal mass of 
the matrix to a sufficiently small value, which eventually can be 
ignored. A diagonal containing the eigenvalues then remains. 


Annihilation of 2 off-diagonal elements of a symmetric matrix 


takes the form, 
Cy. Sr. ol! #1) 0. 
—-s cl. 0 ott) 


Be] 


Kogbetliantz appears to have been the first to apply this method 
to general nonsymmetric matrices [10] (see [8] and [7]). We can 
generalize the above equation to the computation of a 2 x 2 
SVD, by using two different orthogonal rotation matrices [8]. 
A serial—cyclic sweep of a general m xX n matrix A can be per- 
formed either by a cyclic—by—row or a cyclic—by—-column scheme. 

As noted by Brent et al [4] and others, serial cyclic-by-row 
and cyclic-by—column schemes are not suitable for parallel com- 
putation due to column and row conflicts throughout. In §2.2 
we shall indicate that orderings suitable for parallel computa- 
tion would apply [n/2| rotations simultaneously. In terms of 
convergence for algorithms which compute the SVD in a cyclic 


manner we may appeal to the results of Paige and Van Dooren 
[16]. 


at®) gl) 


(zk) () 
at ned £} 


2.2 Exploiting Parallelism 


Sameh was one of the first researchers to observe that there 
is a bound on the number of rotations which may be applied 
in parallel [18], [1], [19]. Given a general m x n matrix, a 
Kogbetliantz cyclic sweep consists of a maximum of 


z max{m, n}(max{m, n} — 1) 
2 


N 


pairs of rotations. Our goal is to complete a sweep in the mini- 
mum number of parallel steps each consisting of the maximum 
number of rotations applied in parallel. In addition the maxi- 
mum number of processors should be kept busy at all times. 

Criteria such as these were originally formulated by Sameh 
[18]. For square matricecs with n(n — 1)/2 elements above 
the main diagonal, it is possible to update or annihilate |n/2| 
elements at a time. Defining r = |(n + 1)/2], we can have 
(2r — 1) rotation sets applied per sweep. To summarize, 


1. An orthogonal rotation set must annihilate or update 
[n/2| elements. 


2. A sweep should annihilate each off-diagonal element only 
once. This implies each of (2r — 1) orthogonal rotation 
sets should annihilate or update |n/2| elements. 


The size of a rotation set is simply determined by the max- 
imum number of non—conflicting column pairings possible. For 
example, given an n X n square matrix with n = 4 we may 
simultaneously apply 2 rotations from the left or right. This is 
equivalent to multiplication by an orthogonal matrix V of the 
form, 


C1 $1 
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The number of parallel iterations in a computation is therefore 


bounded below by 


n(n — 1) x 1 a (2.1) 
2 [n/2| 
or equivalently, 
Self { n _n odd 
n—-1l neven 


Proposition 2.2 For n a positive integer, if r= |(n+1)/2| 
then 


n(n — 1) 1 


2. * (n/2] 


2r—1 


n n odd 
n—-1 nn even 


(2.2) 


Proof. Consider two cases, 
Case 1. When n is even, |n/2| = n/2 so that, 
n(n — 1) . —_— 

2 [n/2] 


Furthermore since n is even, n+ 1 is odd, hence |(n + 
1)/2| = |n/2| = n/2 and 


n-l. 


Peta) ate yet 
\4/ 


Case 2. When n is odd, |n/2| = |(n — 1)/2| = (n—1)/2 and 


n(n — 1) 8 eee 
2 [n/2] 
With n odd, n+1 is even, so that |(n+1)/2| = (n+1)/2 
and i 
ar -1=2 ("5 )- =n. 


If we assume that n is even, then not all algorithms de- 
scribed in the literature have achieved the n — 1 lower bound. 
Sameh’s implementation of Hestenes’ one-sided computation 
on a linear array of processors requires 3n — 2 parallel itera- 
tions per sweep [19], whereas Brent and Luk report that they 
achieve the minimum with their systolic array [4]. 


2.3 A One-sided Computation 


When we consider general non-square m X n matrices where 
m > n there exists a convenient computation for the SVD 
which is appropriate for parallel implementation. This method 
is based on a one-sided computation originally due to Hestenes 
[9]. It is referred to as one-sided because orthogonal rotations 
are only applied from the right, updating columns. Brent and 
Luk’s [4] systolic array implements Hestenes’ algorithm. Basic 
operations in each processor of their array reflect a tournament 
ordering scheme for rotations performed in parallel. The per- 
formance of their scheme is analyzed in §3. Eberlein [5] has pro- 
posed a block variant of Hestenes’ algorithm on a hypercube, 
suitable for computing either singular values or eigenvalues of 
symmetric matrices. 

Hestenes’ one-sided computation produces an orthogonal 
matrix V and a matrix Q with orthogonal columns such that 


AV =Q = [q1, G2,°**5 Qn]. 


where Ais mxn,m>n. The Euclidean norms of the columns 
will be equated with the singular values of A. 


(2.3) 


T  _ 2 ae 
q3 Qj = 97 53, 1,7 =1,.-.,7. 


By normalizing the columns, we see that the SVD of theorem 
2.1 is implicit in (2.3) 


Q=Uxd, A=UxVtT 


A one-sided algorithm is somewhat different from its earlier 
counterparts, as rotations are applied from the right and there- 
fore only columns are affected. Off-diagonal elements are no 
longer annihilated, instead rotations are designed in order to 
produce two orthogonal columns. As with similar Jacobi-like 
algorithms, the orthogonal matrix V may be accumulated from 
plane rotations J(7,7,9) which differ from the unit matrix J, 
in a 2 X 2 principal submatrix containing the cosines and sines 
of the rotation. Setting A; = A, the k-th iteration updates A, 


Ak+1 = AxJ (1,9, 0%). 


If the matrix sequence A, converges, the result is Q in (2.3). 
A column update via a 2 x 2 submatrix takes the form 


Haha 


The orthogonality condition determines the rotation angle @. 
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(2.4) 


By avoiding a potential loss of significant digits the magnitude 
of the angle may be restricted to |@| < 2/4 and provides for- 
mulae for the rotation (see Nash [15] and Rustishauser [17]). 

As noted by Brent and Luk [4], if a cyclic-by—row rotation 
ordering is chosen to update the n(n — 1)/2 column pairings 
determined by the off-diagonal elements above the main diago- 
nal, convergence would follow. Hestenes’ computation is math- 
ematically equivalent to a Jacobi algorithm applied to A? A, 
therefore we expect that the convergence analyses of Forsythe 
and Henrici [6] or Wilkinson [22] are applicable under these cir- 
cumstances. Rather than testing for convergence, the threshold 
Jacobi method originally introduced in the symmetric eigen- 
problem is often employed [23, pp. 277-278], [17]. 


3 Parallel Computation 


3.1 Maximizing Concurrency 


In this paper the computation cost is measured by the number 
of parallel computation steps. The methods discussed process 
(i, 7) pairings consisting of partitions containing at least 1 col- 
umn or row. When n is even, if we assume one parallel com- 
putation step has unit cost, then of the algorithms presented 
the minimum cost achieved is n — 1 per sweep. The systolic ar- 
ray and associated algorithm proposed by Brent and Luk were 
proven to achieve this lower bound in [4]. We have illustrated 
their basic scheme in figure 1 for the case n = 8, where a linear 
array of four processors {P;, P2, P3, Ps} is used. 


3.2 Minimizing Communication Costs 


Another important performance criteria for a parallel algorithm 
is the total communication cost. For our purposes the commu- 
nication cost can be measured by the total number of inter- 
processor transactions (messages). A transaction consists of a 
column transmission between a pair of processors. The total 
communication cost of one sweep will be denoted C. 

From the last section, we know that the minimum number 
of computation steps in a sweep is K = n—1. The minimum 
number of interprocessor transactions is achieved when each 
processor retains one column from a pairing, and transmits the 
other to a destination processor. As a result, if there are p pro- 
cessors, p transmissions are performed between two consecutive 
steps. Hence the minimum total communication cost C,,;,, is 
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P3 


P4 


Figure 1: Brent and Luk’s Systolic Array 


defined by the following. 


CS Ap: (3.1) 


In the parallel one-sided SVD algorithm each processor is as- 
signed one of n/2 column pairs at each step, assuming n is even. 
The total number of processors required is p = n/2 in (3.1) and 
the communication costs are O(n’). 


(K —1)p 
n(n — 2) 
2 


Cmi n 


As a contrast, a global broadcasting strategy may request 
each processor to send both columns to all other p—1 processors 
between each step. The total cost for this case will be O(n*). 
Brent and Luk’s algorithm has the following communication 
cost. 


Car = (K-1)x2p 
= (n — 2) x 2(5) 
= n(n-— 2). 


Therefore their algorithm is close to, but not quite optimal. 
In fact the inefficiency lies in the double sends and receives 
between processors in the systolic array which are dictated by 
the tournament ordering. 

Several ways of modifying Brent and Luk’s algorithm to 
avoid the double sends and receives have been proposed [13], 
[14], {5], [2]. These algorithms all represent a communication 
regimen based on a ring topology. A ring topology resembles 
the architecture of a linear array of processors. Embedding a 
ring within another topology, for example the binary n—cube, 
requires a special mapping scheme. 


4 An Optimal Parallel SVD Algorithm | 


In this section we present a new parallel Jacobi—like algorithm 
which is optimal in terms of both achieving maximum concur- 
rency and minimum communication overhead. The algorithm 
relies on a recursive divide—exchange of n = 2% columns. 

Unlike several orderings cited earlier, the new algorithm 
maps naturally onto parallel architectures which support re- 
cursive pairwise exchanges. A mapping onto a hypercube is 
presented as an example in §5. Pairwise exchanges of columns 
here are specified by a Perfect Shuffle of processor addresses 
(21). 


4.1 The Parallel Algorithm 


Let us first illustrate the basic principle of the new algorithm 
- through an example where n = 8 and p= 4. The computation 
steps K; and communication steps X; consisting of pairwise 
exchanges, are shown in figure 2. 


Figure 2: Recursive Divide—Exchange 


Initially the 8 column indices are divided into two sets. 
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(4.1) 


The pairs {(1,2), (3,4), (5,6), (7,8)} € G1 x G2 are assigned, 
in order, to processors in the set P = {P,, Pz, Ps, Pa}. 
The algorithm for n = 24 = 23 = 8 consists of three parts: 


Part 1: Compute—Exchange stage. The first stage consists of 
n/2 = 4 computation steps {K1, Ko, Ks, K4} and n/2— 


1 = 3 communication (exchange) steps {X , X2,X3}. In 
one computation step, each processor performs a plane 
rotation on an (#,7) pairing. A communication step X; 
exchanges columns with indices in Gz between processor 
pairs. 


Part 2: Divide step. Processors are divided into two sets 


PL ={Pi,Pa}, Pa = {Ps, Pa}. 
The column indices in G are divided into two subsets, 
G3 = {1,3}, G4 = {5,7}, 
and are assigned to /;. Similarly, G2 is split into 
Gs = {2,4}, Ge = {6,8}, 
and assigned to 2, as indicated in figure 2 by step Dj. 


Part 3: Recursively solve the two subproblems using a scheme 
similar to parts 1 and 2. A subproblem consists of n’ = 
n/2 = 4 column pairs and n'/2 = 2 processors. 


In order to specify the pairwise exchange of columns be- 
tween processors described in part 1 above we introduce the 
notion of distance between processors. Given an index set 
S = {1,2,...,N} synonymous with processor addresses and 
a set of processors P = {P;: 1 € S} we have, 


Definition 4.1 The distance s € S between processors P;, € P 
and P;, € P is defined to be | 


s = |t1 — 72| 
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The algorithm (for n = 4 d= 3) can be unwound into a 
sequence of d = 3 compute-exchange stages (with one divide 
step between each pair of successive compute—exchange stages) 
as shown in figure 2. 


{ Ki, X1, Ke, Xa, K3, X3, Ka, Dy, Ks, Xs, Ke, D2, K7}. 


Each exchange step X; is a parallel pairwise exchange of 
column indices in G2 between processor pairs (P;,, P;,), where 
P,, and P;, are at a distance 2”, ( h > 0, A an integer and 
11; < tz). Furthermore, the binary representations of 1; and 
t2 may only differ in bit position h. For example, the three 
communication steps X;, X2 and Xg result in the exchange 
pairings illustrated in figure 3. 


(Pi, P2), (Ps, Ps) 


(Pi, a a P4) 


Figure 3: Parallel Processor Pairings 


In general, the algorithm (for n = 2%) can be unwound into 
a sequence of d compute—exchange stages (with one divide step 
between each pair of successive compute—exchange stages). If 
we number the d compute—exchange stages by k: k = 1,...,d, 
the k-th compute: exchange stage consists of 27-* com- 
putation steps K;, 1 = 1,...,2?~* and 2¢-* — 1 communication 
(exchange) steps X;,/=1,... ,24-* — 1 forming 


=n/2 


{K1,X1, Kz, X2,...,Xqa-k_1, Kga-r}. 


1. At each computation step AK), processors concurrently 
compute rotations on their assigned column pairings. 


2. At each communication step X; a parallel pairwise ex- 
change of columns with indices in G2 is performed be- 
tween processors pairs at a distance 2", where h is given 
by the function, | 


b= n= { haan 


q is the largest integer which satisfies 27 < l. 


if [ = 29, 
if 1 > 29, 


4.2 Computation and Communication Costs 


Let n = 24, and the total number of computation steps be f(n). 
If g(n) is the number of computation steps in stage 1 then a 
recurrence relation for f(n) is 


f(n) = g(n) + f(n/2) 


From our description of the algorithm we have g(n) = n/2, 
hence /2-+ f(n/2) : 
n/it+fi(n n> 2, 

f(n)= { 1 ako (4.2) 


Solving the recurrence (4.2), we have f(n) = n— 1. Therefore, 


we have verified the fact that the new parallel algorithm has 


achieved the optimal computation cost. The reader should note 
that in solving the above recurrence, a geometric progression 
corresponding to the stage lengths results. 


To establish that we have achieved the optimal communi- 
cation cost consider a stage k consisting of 24~* computation 
steps and 24-*—1 communication steps. For n—1 total compu- 
tation steps, d stages are required. The inter—stage divide steps 
account for d — 1 of the total. The total number of communi- 
cation steps c(n) may be derived from a recurrence relation. 


n> 2, 
n= 2. 


i ={ n/2-+ e(n/2) (4.3) 


Solving (4.3), we obtain c(n) = n — 2. Multiplying by the 
number of processors p = n/2 gives the communication costs 
Cpr for our recursive divide-exchange algorithm. We have 


achieved the optimum since Cpr = C min. 


n(n — 


2) 
aes. 


Cor = (n-2)p= 


Referring to our example in figure 2, 4 column transactions 
have occurred at each communication step with a total cost of 
6 x 4 = 24 transactions, which is optimal. 


5 Mapping onto the Hypercube 


In order to map the recursive divide—exchange algorithm of §4 
onto a hypercube architecture we must first specify the oper- 
ations performed by each processor in the cube. Given the 
two major components of our algorithm, namely a compute- 
exchange and a divide, deriving an algorithm for individual 
processors is straightforward. Due to the tail—-recursion in the 
parallel SVD algorithm, it may be transformed into an iterative 
form. 


Algorithm Divide—Exchange 


for k=1toddo 
for | = 1 to 27-* —1 do 
Compute (i, 7) 


q = hil) 
Exchange 22 
end 


Compute (7,7) 
Divide 2¢-*-1 
end 


The step “Compute (#, 7)” refers to a column update in the 
parallel version of Hestenes’ one-sided computation. Using the 
terminology introduced in §4, each processor cycles through a 
Jacobi-sweep consisting of d stages. A divide step, exchanging 
at a distance of 2~! would not be carried out. The function 
h(J) computes the height of an exchange node X;, where | is 
the label number derived by an inorder traversal of a complete 
binary tree. 


Function h(!) 


begin 
q = [logs !| 
t=1—- 2! 
if ¢ = 0 then 
return g¢ 
else 
return h(t) 
end 
end 
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The relative ease of mapping a recursive divide—exchange 
onto the hypercube is due to the recursive nature of the hyper- 
cube itself. The fact that a hypercube is recursively constructed 
out of lower dimensional subcubes may be exploited. A di- 
vide step in our algorithm corresponds to a subdivision of the 
problem, allowing computations to proceed on the subcubes. 
Exchanges will always consist of communication between pairs 
of nearest neighbours on the hypercube. A cube of dimension 
d —1 is required for a problem with n = 2¢. The computa- 
tion and communication steps are determined by the exchange 
sequence shown in figure 3. 


5.2 Processor Pairings 


Nearest neighbour processor pairings on the hypercube may 
be determined by a Perfect Shuffle of node addresses. Stone’s 
original paper [21] details the generation of such pairings via 
a left cyclic shift of the bits in an address. A perfect shuffle 
of an N element vector is a permutation P of the indices or 
addresses a of the elements such that 


<a< —_ 
P(a) = { 3° O<a< N/2-1, 


2a+1—-N N/2<a<N-1. 
Consider the binary representation of an integer address for 
which N = 2%. Individual bits at position 7 are denoted a;. 


(5.1) 


a= apo” Pago 2 * hoe a) a (5.2) 
A perfect shuffle (5.1) of an address a creating a new address 
a’ corresponds to a left cyclic shift of all bits a; to aj, with 
the leftmost bit ag_1 wrapped around to apo [21]. 


a! = ag—22* | + ag_g2*-? + +++ + .a92 + aa-1 


Our earlier requirement for a pairwise exchange of columns 
at a distance 2” is easily satisfied, due to the geometry of a 
hypercube. The implication is that for addresses of the form 
(5.2), a difference in a single bit a; indicates a distance of 2°. We 
also note that the addresses of neighbouring processors in the 
hypercube differ in only one bit position. Exchanges, therefore, 
will always be between directly connected neighbours. 

Processor nodes in a hypercube are labelled from 0 to 2¢—1, 
for example in a 3—dimensional cube there are 8 processors with 
addresses 0 to 7. We can use the perfect shuffle to generate 
processor pairings required for exchanges at a distance which is 
a power of 2. This may be illustrated by an example with d = 3. 
Initially processor pairings for exchanges are at a distance of 


101 
111 


6 110 +) 
7 111 7 


Figure 4: 3—-Dimensional Processor Pairings 


1. After a perfect shuffle from addresses a to a’ exchanges may 
take place at a distance of 2, from a’ to a” at a distance of 4 
and so on. Processor pairings before and after a perfect shuffle 
are given in figure 4. 

The exchange and divide steps required to complete one 
sweep of a Jacobi-like algorithm, when n = 2* = 16 are illus- 
trated in figure 5. 


There are 15 computation steps. 
(i,j) at each step are written in the processor nodes, 
The communication links used between the computation 
steps are marked, 


The column pairs . 


Figure 5: Divide—Exchange on a 3—Cube 


5.3 Computational Results 


An implementation of Hestenes’ one-sided SVD via the non— 
recursive version of our algorithm was written in ‘C’ for sub- 
sequent testing and analysis on the Intel iPSC hypercube. A 
simulator for the hypercube was provided by Intel Scientific 
Corp. to McGill University for a SUN 3/280 running the BSD 
4.3 operating system. This SUN has an IEEE 754 standard 
co—processor with a floating point precision of « = 2.22 x 10716 
in double precision arithmetic. 

A threshold Jacobi method, as described in §2, was em- 
ployed to insure proper termination of the algorithm. Following 
the methods introduced by Berry eé al in [1] for computation 
on an array processor, each node processor in the hypercube 
maintains a counter tstop. The counter is incremented by a pro- 
cessor when one of its assigned column pairs (7,7) is deemed to 
be orthogonal according to a threshold parameter r. For the 
purposes of our tests we chose 7 = é||A||z. 

The parallel computation terminates at the end of a sweep 
if each of the n/2 processors report istop counts of n—1. Fora 
series of random 8 x 8 matrices generated using the interactive 
matrix software package Matlab, we typically observe conver- 
gence in the hypercube computation after 6 sweeps. 

Finally we have observed a communication pattern for ran- 


dom 16 x 16 matrices matching exactly with that shown in 
figure 6. 
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G6 Conclusions and Future Research 


We have described a new optimal parallel Jacobi-like algorithm 
for the singular value decomposition (SVD). We have demon- 
strated that the new algorithm can be mapped naturally onto 
hypercube architectures, effectively utilizing the nearest neigh- 
bour communication capacity throughout the computation. In 
general, the recursive pairwise exchange communication opera- 
tions of the new algorithm can be efficiently supported by mul- 
tiprocessors with interconnect patterns used in many networks 
that have been proposed to support large-scale parallelism [20]. 
For example, we believe that the new algorithm can be mapped 
effectively onto SIMD or MIMD parallel computers with inter- 
connection networks such as PM2I-based networks and cube-— 
based networks These interconnection networks have the par- 
titionability property: the ability to divide the network into 
independent subnetworks of different sizes [20], which match 
the recursive divide-exchange structure of the new parallel al- 
gorithm proposed in this paper. 

We suggest the following future research directions: Study 
extensions of the new algorithm to various forms of the SVD, 
to the unsymmetric eigenvalue problem and to the generalized 
eigenvalue problem Az = ABz. Furthermore, we would like to 
gather empirical information concerning convergence properties 
from numerical simulations. 
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ABSTRACT 


Directed Acyclic Graphs (DAGs) have been extensively used to 
model parallel sparse Gaussian Elimination and its scheduling on 
multiprocessors, even though their use leads to sub-optimal 
schedules. In this paper, task graphs containing directed edges as 
well as undirected edges, called Minimally Constrained Task 
Graphs (MCTGs) are proposed to model parallel sparse 
Gaussian Elimination. An algorithm for scheduling MCTGs on 
multiprocessors is presented and the generated schedule is 
proven optimal. The scheme is evaluated using a number of 
practical matrices arising from circuit simulation and shown to 
be significantly better than scheduling using DAGs. 


L Introduction 


The repeated solution of large sparse linear systems 
of equations using Gaussian Elimination (GE), or a variant 
, thereof, is a computationally intensive component of many 
practical applications such as structural analysis, circuit 
simulation etc. Consequently there is considerable interest in 
parallelizing the solution of sparse matrices[1-9,11-15]. In 
order to identify the maximal degree of potential parallelism 
in sparse GE, it has been customary to view the computation 
at the level of elementary arithmetic operations using a 
directed acyclic graph (DAG). The vertices of such a DAG 
denote elementary arithmetic operations and edges between 
vertices represent execution dependencies between the 
operations. The dependence structure of the DAG is 
determined by a "symbolic" trace of the sequential form of 
the GE algorithm, creating edges from each given vertex to 
all vertices that use the value generated by it [2,5]. It has 
been recognized that the use of such a DAG for scheduling 
the operations of GE on a multiprocessor can result in sub- 
optimal schedules [5], but the resolution of this problem has 
not been previously pursued. 


The problem with the use of a DAG based on 
symbolic unraveling of the sequential GE algorithm, for 
identifying dependence constraints on parallel execution of 
GE, is that the accumulative updates to any matrix element 
are unnecessarily constrained to take place in exactly the 
same order that they would be performed with sequential 
GE. In this paper, we propose the use of task graphs with 
directed edges as well as undirected edges to model parallel 
GE. Directed edges are used only to represent strict temporal 
dependencies, while undirected edges model constraints on 
the non-simultaneity of execution of multiple updates to a 
common matrix element. We refer to these task graphs as 
Minimally Constrained Task Graphs (MCTGs) and present 
an algorithm to schedule such graphs on a shared-memory 
multiprocessor. The optimality of scheduling parallel GE 
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using the proposed scheduling algorithm is proved under the 
idealized model of a Concurrent Read Exclusive Write 
(CREW) multiprocessor with unbounded number of 
processors. 


Optimal scheduling of DAGs on an_ idealized 
unbounded multiprocessor is very simply done using 
critical-path scheduling. Thus previous studies on parallel GE 
under the CREW model have typically focussed on heuristics 
for one of the following two (NP-complete) problems: 1) 
Given the zero-nonzero structure of a matrix, find a 
permutation for the rows/columns of the matrix ‘so that the 
task graph (DAG derived from the dependence structure of 
the operations constituting sequential GE on the permuted 
matrix) has minimum depth (and hence minimal finishing 
time) [2,5,12,15]; and 2) Given a= specific ordering 
(permutation) of rows/columns, find a schedule for a finite 
number of processors that minimizes finishing time [13,14]. 
In this paper, we do not focus on either of the above issues - 
matrix reordering for parallelism or scheduling on a limited 
number of processors. Rather, we focus on the fact that the 
underlying DAG-based model of parallel GE used by earlier 
studies is inherently overconstraining, and we provide an 
approach to avoid this problem using the notion of MCTGs. 
We present this framework under an idealized machine 
model; however the concept of MCTGs has wider 
ramifications and is more appropriate than a DAG-based 
model for the other problem formulations in this context. 


The paper is organized as follows. In section 2, we 
use an example from [5] to explain the problem of 
suboptimality of scheduling with overconstrained DAGs. In 
section 3, we propose the use of MCTGs and provide an 
algonthm for scheduling MCTGs on an_ idealized 


- multiprocessor. In section 4, we prove the optimality of the 


assignment generated by the scheduling algorithm for 
MCTGs arising with sparse GE. Section 5 is concemed with 
empirical evaluation of the algorithm. Various test matrices 
arising from the application domain of circuit simulation are 
used in the study. The scheduling algorithm proposed in this 
paper results in completion times that are up to forty percent 
less than that achieved by the DAG-based algorithm. 
Section 6 concludes the paper with a brief discussion. 


2. DAG Based Scheduling of Parallel Sparse GE 


We first outline the sparse GE algorithm for the 
solution of linear systems of equations. In solving the system 
Ax = b, where A is a sparse NxXN matrix and b is an N- 
vector, values for the N-vector x are sought that satisfy the 
simultaneous equations. As is usual, for convenience of 


representation of the algorithm, we represent the right hand 
side vector b as an additional N+1’st column of A. 


In sparse GE, the order in which the variables are 
eliminated has a significant impact on the number of fill-ins 
(zero elements of A that become non-zero during the 
elimination process) created and hence the total number of 
arithmetic operations. Therefore, the actual elimination is 
preceded by an ordering phase, in which, based on the zero- 
nonzero structure of the matrix an elimination order which 
reduces the number of fill-ins is determined[10]. This 
ordering is then used for repeated solution of different sets of 
equations with the same zero-nonzero structure. Also, during 
the ordering phase the actual locations of the fill-ins are 
determined. Thus, in the following when we refer to a non- 
zero element of A it pertains to the filled-in matrix rather 
than the original one. 


/* MATRIX TRIANGULATION */ 
M1 fork =1,N 
M2 for each j in [k+1,N +1] such that A,; #0 do 
Ay; — AgilAre 
endfor MZ 
for each 7 in [k+1,N] such that A, # 0 do 
for each j in [i+1,NV+1] such that A,; # 0 do 
Aj, — Aj; - Ax*Ag; 
end for M4 
end for M3 
end for Ml 


M3 
M4 


/* BACK SUBSTITUTION */ 
Bl fork =N,1,-1: 
B2 for each j in [k+1 ,N] such that A; # 0 do 
Aga — Agnar - Agjy*Aj nat 
end for B2 
end for Bl 


Figure 1. Sparse Gaussian Elimination Algorithm 


GE (outlined in figure 1) consists of two steps - 1) 
matnx triangulation, and 2) back substitution. Matrix 
triangulation may be viewed as a sequence of elementary 
operations - either the division of a matrix element by the 
diagonal element in that row, or an incremental multiply- 
subtract (update) operation on a matrix element. The back- 
substitution phase of GE involves only update operations. 
We focus in our exposition only on the triangulation phase of 
GE, but all our observations relating to the update operations 
in the triangulation phase are directly applicable to the 
update operations of the back-substitution phase also. Thus, 
in the following, we sometimes refer to the triangulation 
phase of GE simply as GE. The scheduling issues for 
variants of Gaussian Elimination, such as LU factorization 
with forward/back substitution are also essentially the same. 


Figure 2 shows an example of a sparse matrix (taken 
from [5]) and the sequence of elementary operations that 
constitute the triangulation phase of GE for this mattix.' This 
sequence of operations is obtained by symbolically tracing 
the above GE algorithm. A sequential execution of the GE 
algorithm involves stepping through this operation list in 


1. In order to remain consistent with the example used in [5] the operations 
on the r.his. vector are omitted. However, this does not affect our 
presentation. 
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order. A parallel implementation of GE will require the same 
set of operations to be performed, each operation being 
executable as soon as data-dependence constraints are 
satisfied. The dependence constraints can be captured by a 
graph, as shown in figure 3a. Vertices of this graph represent 
the elementary operations of the operation list. A directed 
edge is drawn from a vertex to another if the value generated 
by the source vertex is utilized by the computation at the 
destination vertex. Such a DAG can be used to schedule the 
operations of GE on a parallel computer - a vertex (task) can 
be scheduled as soon as all the tasks that are represented as 
source vertices of its incoming edges have completed 
execution. Assuming an idealized shared-memory 
multiprocessor with arbitrary number of processors, where a 
divide operation and an update operation each take one unit 
of time, the parallel completion time is clearly the length of 
the critical path of the DAG. 

- Ayg —ApglAy 

- Aggy — Aggy — Agi XAt4 

: Ads -_ Ads /A 22 

Ass <— Ass — As2 XAp5 
Ay — Ax4/A33 

A 36 < A 46 /A 33 

Au Ag - A43 XAx 


A 46 < A46 -A43 XA 36 


* 
* 
CP NAW ew HD = 


Ags Ags —-Ag3XAx% 


NoPWDN bP 
” 
* 
+ 


mM 


» Ag <— Age — A63 XA 36 
Li A 46 — Ayg/Aayg 

A aL - 
13. As6 — As6/As55 
14, Age — Age — Ags XA 56 


Figure 2. A Sparse Matrix and its Triangulation [5] 


x * x 


A < A 66 — Ags XA 4g 


(a) Level 6 (b) 


DAG derived from operation 


; DAG obtained by reordering 
list shown in figure 2 


operations 12 and 14 


Figure 3. Non-Optimality of DAG-Based Scheduling 


DAGs have formed the basis for prior studies relating 
to the parallel scheduling of GE [2,5,7,8,11-15]. They are 
however overly constraining because they require that 
multiple updates to a matrix element during a parallel 
execution occur in exactly the same order that they would 
have taken place if executed sequentially. The order in which 
multiple independent updates to a matrix element occur is 
clearly irrelevant as long as all of them are completed before 
the fully updated matrix element is used as an operand in 
Some other operation. In the example used, as pointed out by 
Huang and Wing, operations 12 and 14 both represent 
(independent) update operations on A,,, and can therefore be 


executed in any order without affecting the final results 
computed, Figure 3b shows the DAG that results from 
interchanging these two operations; it has a shorter critical 
path compared to the original DAG. 


3. Minimally Constrained Task Graphs and their 
Scheduling 


The use of DAGs based on the dependence structure 
of operations of sequential GE is thus overly constraining 
with respect to the update operations in scheduling GE for 
parallel execution. All independent update operations to a 
matrix element should be independently schedulable except 
that no two of them can occur simultaneously. The use of 
directed edges between such update operations thus forces an 
unnecessary and arbitrary precedence constraint, whereas all 
that is really required is a weaker "non-simultaneity" 
constraint. We propose the use of task graphs, called 
Minimally Constrained Task Graphs (MCTGs), where a clear 
distinction is made between strict temporal ordering 
constraints and non-simultaneity constraints. 


MCTGs use both directed edges and undirected 


edges. Directed edges are used as before to represent 
temporal dependence constraints. |§ Non-simultaneity 


constraints are expressed using undirected edges. Two 
operations that are prohibited from occurring at exactly the 
same time, but are otherwise executable in either order are 
connected by an undirected edge. The set of update 
operations to a matrix element in GE can be performed in 
any order and will produce the same final result (except for 
round-off errors) due to the commutativity and associativity 
of the addition. Thus all update operations on any matrix 
element form a clique of vertices in the MCTG, connected 
among themselves by undirected edges, as shown in figure 4 
for the same example as before. Directed edges are used 
between each update operation and succeeding operations 
that use the final updated value. As a matter of terminology, 
we refer to nodes that are connected by undirected edges as 
sibling nodes and the relationship represented by the edge as 
a sibling relationship. The notation (i,j) is used to denote 
the undirected edge between nodes i and j. A directed edge 
from node n to node m is denoted by <n,m>. The node n is 
referred to as a parent of the node m, while the latter is 
called a child of the former. 


Figure 4. Minimally Constrained Task Graph for Example 


We now present the formal definition of GE MCTGs. 
Given a sparse matrix and the sequence of operations that 
constitute its triangulation (see figure 2 for an example): 
For each update operation i, d,, f,; and s, denote the 
destination, factor and source elements respectively, i.e., 
d; «-d;-f;xs;. By definition, d, and J, belong to the same 
row of the matrix. For each normalization operation #, d; and 
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f; denote the destination and factor elements respectively, 
i.e., d;<—d,/f,. For each operation i, r; denotes the position 
of i in the operation list. Further’ 


Dii)= arg max r 
@) ie 2 aaa eo 

and, 
F(i)= argmax 


Creer: 


ln addition, for update operations we similarly define 


S(i) = arg max r 
@) rT he aT 


We now define the GE MCTG as follows: 


For any pair of update operations, i,j, the undirected edge 
(i,j) exists if and only if: 


d; = d; 
For each update operation i, the directed edge <j,i> exists if 


and only if: 
Ll. fj =S(@) or 
2. j=F(i) or 


3. U,F(é)) exists | 
For each normalization operation i, the directed edge <j,1> 
exists if and only if: 


l. jy =D() or 
2. (Y,D(e)) exists or 
3. j=F(i) or 


4. (j,F (i)) exists 


Note from the definition of undirected edges in the GE 
MCTG that they form cliques between tasks that constitute 
the updates of a common matrix element. We refer to this 
feature as the clique property. Regarding directed edges, first 
note that for any update operation i, S(é) exists, and is a 
nomalization operation. Hence, if a task has a sibling, it 
must have a parent. Further, S(i) does not belong to a clique. 
Therefore, the child of any member of a clique is a child of 
every member of that clique. We refer to this last feature as 
the common children property. 


For purposes of comparison, we now use the above 
terminology to define the GE DAG used by Huang and 
Wing[5] and subsequent researchers. 


In a GE DAG: 


For each update operation i, the directed edge <j,i> exists if 
and only if: 


l. js =D(é) or 
2. j = F(t) or 
3. j =S(i) or 


For each normalization operation i, the directed edge <j,i> 
exists if and only if: 

l. js =D(i) or 

2. j =F(i) or 


___ Thus, the interpretation of directed edges in MCTGs 
is slightly different from the interpretation when using 


2. arg max f(x) is the value of x in X at which the maximum value of Fix) is 
attained. 


DAGs. With DAGs, an edge represented a temporal 
constraint in that a data result produced by a parent operation 
was needed and directly used by a child operation in the 
DAG. With MCTGs, a directed edge again represents a 
temporal execution constraint in that the parent operation 
necessarily has to be completed before the child operation 
can be executed. However, the value produced by the 
execution of the parent operation is not necessarily directly 
used as an operand for the child operation. This 
interpretation of a directed edge permits the necessary 
flexibility in scheduling the multiple updates of a matrix 
element to optimize GE completion time. Thus, unlike GE 
DAGs that are irredundant [14], an MCTG is not an 
uredundant graph; but as can be seen in the next section, this 
poses no problems in its optimal scheduling. 


We now present an algorithm for scheduling MCTGs 
on an idealized CREW multiprocessor. As has typically been 
assumed in prior treatments on scheduling parallel GE 
[2,13,14], we consider an update operation and a divide 
operation to take the same (unit) amount of time. The 
algorithm however can be trivially generalized to handle 
non-uniform execution times for the various operations. The 
scheduling problem may be viewed as that of the assignment 
of positive integer level numbers to the nodes of the MCTG 
so that: 

1) each node has a level number that is higher than that of 
any of its parent nodes (if any), 

2) no two sibling nodes are assigned the same level, and 

3) the highest assigned level number is as small as possible. 


/* ALA : ALGORITHM for LEVEL ASSIGNMENT */ 


{* initialization */ 

ZeDO 

for each root node m of G do 
La~-l; 

— LEZ {m} ; 

end for 

for each non-root node n of G do 
P,,€- Number of parent nodes of n ; 
E,@1; 
F false ; 

end for 


[* main */ 
while Z is not empty do 
Remove a node n from Z : 
for each child m of nin G do 
E,.max(E,,,L, +1): 
PAP ,»-1; 
if P,. = 0 then 
LCE, : 
while any sibling ¢ of m has 
(F; = true and L; = L,,) do 
_ L,,«-L, +1; 
F true ; 
Z—Z\){m} ; 
end if 
end for 
end while 


Figure 5. Level Assignment Algorithm 
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Assuming unit execution times for the operations of 
the MCTG, the level number assigned to a node corresponds 
to the earliest time at which that operation can be scheduled 
for execution on an idealized CREW multiprocessor. The 
level assignment algorithm shown in figure 5 associates a 
quadruple (P,L,E,F) with each node of the MCTG G. P,, is a 
counter associated with node nv that is initialized to the 
number of parent nodes of n in G. L, is the level number 
assigned to n. E, represents the earliest possible level 
assignable to n, based on directed-edge constraints; it is 
initialized to 1, and successively modified as the algorithm is 
executed. F,, is a flag associated with node n, to keep track 
of whether or not its level assignment has been finalized yet. 


The algorithm essentially traverses the directed edges 
of G, ensuring that temporal dependence constraints are 
satisfied. Each directed edge <n,m> is only traversed once, 
and only after the source.node n has been assigned a level 
number by the algorithm. All root nodes (nodes without any 
incoming directed edges or undirected sibling edges) are 
initially assigned a level number of 1 and placed into an 
operating set Z. This set Z is used to temporarily maintain 
nodes whose level numbers have been finalized, until all 
outwardly directed edges from them have been traversed by 
the algonthm. As each edge <n,m> is traversed, E,, is 
updated to be at least L, + 1, if it is not already so. P,,, the 
counter associated with node m is decremented by one. If the 
currently traversed edge <n,m> is the last incoming edge to 
m to be traversed, then P,, becomes zero, and node m is 
assigned its finalized level number L,,. The earliest level it is 
schedulable at is its current value of E,,, provided that none 
of its siblings (if any) has already been assigned at that level. 
If E,, is prohibited for node m due to sibling conflict, then 
E,, + 1 is tried, and so on until the lowest conflict-free level 
is determined and assigned to L,,. F,, is now set true to mark 
the assignment of a level to node m, and m is added to the 
operating set Z. 


For graphs without undirected edges, the above 
algorithm reduces to the conventional DAG critical-path 
scheduling algorithm, and will clearly produce a unique, 
optimal levelization irrespective of the order of selection of 
nodes from the operating set Z for edge traversal. However, 
in general, when undirected edges are present, different 
orders of selection of nodes out of Z and of traversing the 
directed edges emanating from them can result in different 
schedules. This is illustrated by the examples in figure 6. In 
figure 6a, after node 1 (the only root-node) is assigned level 
number one, its outgoing edges could be traversed in any 
order. If <1,2> is traversed before <1,3>, the levelization in 
(ii) results, but if child node 3 is selected before node 2, then 
the different levelization shown in (i) is the outcome. 


Nevertheless, in the case of GE MCTGs, it can be 
shown that all possible schedules, produced by various 
selection orders, have the same (optimal) finishing time. This 
18 a consequence of the Clique and Common Children 
properties of GE MCTGs that guarantee the optimality of the 
"greedy", "on-the-fly" approach to sibling conflict resolution 
adopted by ALA. Considering a set of nodes of G that form 
a clique, if these nodes have distinct earliest-schedulable- 
times, then they will each be so scheduled, leading to 
optimal scheduling. If some of these earliest-schedulable- 
times coincide at a value, say /, the fact that these nodes 
form a clique guarantees that no matter which node is visited 
first and assigned the level J, there will be a conflict of the 


ae (1 Level 1 He 
© re a) | @ res py? 
GS) | @& Ne (6) Levels (2) | (6) 
D) @& (8) Level4 (5) (8) 

©) Gi) Levels (7) iii) 


MCTG Level assignment if Level assignment if 


2 is selected before 3 2 is selected after 3 
(a) Example where Common-Children property is not satisfied by MCTG 


(i) (ii) 


Levei assignment if Levei assignment if 
4 is selected before 3 4 is selected after 3 


(b) Example where Clique property is not satisfied by MCTG 


MCTG 


Figure 6. Selection-Order Dependence of Schedule 


same number of clique sibling nodes at level /+1. As a 
result, the maximum of the levels assigned to the members 
of the clique will be the same (and optimal) independent of 
the specific level that is assigned to each of them. Due to 
the common-children property, any child of a clique node is 
also a child of all other nodes of that clique. Thus, the 
earliest-schedulable-time of the child of a clique is the same, 
independent of the order in which its parents in the clique 
were visited. These two properties therefore result in the 
fact that the ALA schedule for GE MCTGs is optimal. 


The examples in figure 6 demonstrate the necessity of 
the clique property and the common-children property. The 
example of figure 6a violates the common-children property, 
and the two different selection orders shown result in 
different schedules, one of which is suboptimal. Figure 6b 
demonstrates the same point with respect to the clique 
property. In the following section, we first prove the 
optimality of the schedule generated by ALA for GE 
MCTGs. We then prove that any schedule that satisfies the 
constraints of the GE DAG for a matrix also satisfies the 
constraints of the corresponding GE MCTG. These two 
results imply that for any matrix, the ALA scheduling of the 
MCTG will result in a finishing time that is less than or 
equal to that produced by critical path scheduling of the 
corresponding DAG. The empirical results reported in 
section 5 show that the reduction in finishing time can be as 


much as forty percent for matrices arising in circuit 
simulation. 


4. Optimality of the Algorithm for Level Assignment 


As has been discussed in the previous section, GE 
MCTGs have the following three properties which will be 
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used to prove the optimality of ALA for scheduling GE 
MCTGs. 


Property 1: Sibling nodes form cliques. @ 


Property 2: Any two sibling nodes have the same set of 
children nodes. 


Property 3: If a node has a sibling, then it must have a 
parent. @ 
4.1 Preliminaries 

We begin by introducing some notation. 


V = set of all nodes in the MCTG 
S ={neé V | n has no sibling} (solitary nodes of V) 
R={neV | n has no parent} (root nodes of V) 


Property 3 can now be restated as: 
RcS 


For a node m, 


(1) 


P,={nev |nisa parent of m} (parents of m) 
Sn =P,C\S (solitary parents of m) 


For GE MCTGs, since sibling nodes form cliques, we have 
the following additional notation: 


C+ i) 


C = family of all cliques in the MCTG (of cardinality ge 2) 
For a node m, 
Q,= ec IIa P., #@} (clique parents of m) 


and, 


Q,= Ul 
le Q, 


Using the above notation, we can restate Property 2 as: 
P,. = Su\ JOm Vmev (2) 


We now introduce some definitions and notation 
pertaining to level assignment algonthms. We use L; to 
denote the level assigned to node #. 


Definition 1: Given an MCTG G, a valid assignment is an 
assignment of levels (natural numbers) to the nodes of G 
such that: 

1) If <i, j> is a directed edge of G, L; < L,, and 

2) If @,7) is an undirected edge in G, L; # L; = 


For a clique I, let 


C, = max L; 
tel 


that is, C; denotes the completion time of clique I. It follows 
from the above definitions that 


max C; = maxL, Vmev 


Te Q,, ie Q, 
The minimum possible level that can be assigned to the 
solitary node m by any valid assignment is denoted by L,,, 
while the minimum possible completion time that can be 


assigned to the clique I by any valid assignment is denoted 
by C;. 


(3) 


Definition 2: A valid assignment (that assigns levels L;) is 


said to be optimal if 
L, =L, ViesS 
C, =C, VieCa 


We now prove a simple result pertaining to optimal 
assignments. 


Lemma |: For all m € S-R, 


A 


L,, = max (max L,, max C;) +1 
ie S, fe Q 


Proof : Consider any assignment (which assigns levels L;) 
that is optimal. Due to its optimality, 
L,, = max L; + 1 


fe P, 
It now follows from (2) and (3) that 
Li, = max (maxL,;, max C;) + 1 
‘eS, Te Q, 


The lemma now follows from Definition 2. m 


4.2 Clique Resolution Procedure 


The key step in proving the optimality of ALA is 
showing the optimality of the resolution of sibling conflicts 
in the algorithm. We therefore consider an abstract clique 
resolution procedure in this section, prove its optimality and 
then use this result in the following subsection to prove the 
optimality of ALA. 


Problem RU,E;): Given a set I and E; EN (the set of 
natural numbers), associated with each i € J, assign L; € N 
such that 


lL L2&,Viel 
2. LalhWikeLiz#zke 


Any assignment of values for L; that satisfies the above two 

requirements is called a valid resolution for R(I,E;), while 

one that minimizes max L, over all possible valid resolutions 
jel 


is called an optimal resolution for RU,E;). As we shall 
_ prove in the following, the procedure outlined below results 
in an optimal resolution for R(/,E;). 


/* CRP : Clique Resolution Procedure */ 
for all i € I do 
F,<false ; 
LE; ; 
end for 
while there exists i € J with F; = false do 
while there exists m € | with F,, = true and L,, = L; do 
LL; +1; 
F,true ; 
end while 


Note that as stated, CRP does not specify a precise sequence 
in which the F;,’s are set to true. We therefore have the 
following definition. 


Definition 3: A CRP Selection Order is the sequence in 


which the F;’s are set to true by a particular instance of 
CRP. = 


For an assignment corresponding to a selection order, we 
define — 


C = max L; 

ief 

1if skelIL=j 
ie O otherwise 


It follows from the definition of T; that 
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C= max j 

Ui |T,=1} 
Observe that for any selection order, when an element i is 
selected for the assignment of a value to L;, the value that is 
assigned is the smallest available integer that is greater than 
or equal to E;. We formalize this fact in the following 
lemma. 


Lemma 2: For any CRP Selection Order, if T; = (0, then 
Vieln 
We now present the main result of this subsection. 


Theorem 1: Every CRP Selection Order results in an optimal 
resolution for R (/,E;). 


Proof : Let O (that assigns values L,) denote the "worst" 
CRP Selection Order, that is, 


L;>jJ => E;>J 


C = max L; = max Cc 
iel all CRP selection orders 


Let, . 
7 1if tie lL, =; 
Tj = 10 otherwise 


Suppose that the proposition of the theorem is false. Then 
there exists a valid resolution (called O) which assigns levels 
L, such that 


C = max L, <C (4) 
ieTl 
Let, 
T, 


J 


1 if tier lL,=; 
= \0 otherwise 


Since any valid resolution assigns a unique value L; for each 
i € I (requirement 2 of problem R(/,£,)), it follows from (4) 
that there exists at least one level 1 <C at which O has 
assigned an element but O has not, that is, 


T, =Oand T, = 1 


Let /” be the largest such J. For all / >/°, one and only one 
of the following is true: 


T, =T,=1 (Sa) 
T, =T,=0, (Sb) 
T, = 1 and T, =0 (Sc) 
Let, 
M= |f{iel |Z; >1°} | 
M= |{ielI |L,>1°} | 
From (4) and (5), 
M>M (6) 


Since T; = 0, it follows from Lemma 2 and (6) that 
lie TLE >I} | >M>oM 
which contradicts the assumption that O Satisfies requirement 
1 of problem R(/,E;). Hence the lemma. @ 
4.3 Proof of Optimality 


We retum now to the proof of the assertion that 
ALA results in the minimum possible number of levels. 
Recall from_Section 3 that the levels assigned by ALA are 
denoted by L,. Further, 7 


L,2E, WieV-R (7) 
and 

L, =E, Vie S-R (8). 
For a clique I, let 


C, = max L; 
ieTl 


that is, C, is the completion time assigned to clique / by 
ALA. 


Lemma 3: For all m € V-R 
L, =L,W i € S, and C,=C;V Ie Q, 


=> E.. = max (max L,;, maxC;) + 1 
te S, re Q, 
Proof : Let m € V-R be arbitrary. By the construction of the 
algorithm, 


En = maxL, + 1 


ie P, 
But due to (2) and (3), 


E,, = max (max L;, max C,) + 1 
ie S, re Q, 


Lemma 4: For all] € C 
L,=L je lL S; and C,=C,VJe UQ; 


_ ied iel 

=> C, — Cr 
Proof: Let A, which assigns levels L;, be an optimal 
assignment. Let J € C be arbitrary. Since A is optimal, for 
allie I 


LL VieS, 
C=C, Vie Q; 


It now follows from, the assumptions of the lemma, Lemma 
3, and the fact that A is a valid assignment that 


L, = E; Vie 5S, 
L, #L,; Vijeli#*j 


In other words, the assignment produced by A for the nodes 
in J is a valid resolution for R(U,E;). Now, in ALA, different 
selection orders for the nodes in the operating set Z and the 
directed edges to their children will result in different 
sequences in which the F,’s are set to true. However, each 
one of these sequences corresponds to a CRP Selection Order 
for RU,E;), which by Theorem 1 is an optimal resolution. 
But since we have already shown that the optimal_assignment 
A is a valid resolution for R(E,), it follows that C, =C,. w 


Theorem 2: ALA results in the minimum possible number of 
levels. 


Proof : Note first that due to (1) and the construction of the 
operating set Z in the initialization phase of ALA that 


L = 1 =LVie R 
From Lemmas | and 3 and (8) 
L,=L,Vie ye and C,=C,;¥ Ie Q. 


(9) 


=> Ey =L, Vo me S-R. 


Now consider a graph derived from the MCTG in which 
each clique is collapsed into a single node. It follows from 


(10) 
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[14, Lemma 1] that this graph is a DAG. Hence from (9), 
(10) and Lemma 4, it follows via induction that ALA is an 
optimal assignment, which implies that it results in the 
minimum possible number of levels. & 


We conclude this section by proving that for any 
matrix, the number of levels generated by any level 
assignment procedure that satisfies the constraints of the 
corresponding GE DAG (defined in section 3) will be greater 
than or equal to the number resulting from an application of . 
ALA to the corresponding MCTG. This is done by showing 
that any valid assignment for the DAG is a valid assignment 
for the MCTG. 


Definition 4: A level assignment A (that assigns level L, to 
node k) is said to be valid for a given DAG if: 


<i,j> is a directed edge of the DAG => L,<L; @ 


Theorem 3: A level assignment that is valid for the GE DAG 
corresponding to a matrix is a valid assignment for the 
corresponding GE MCTG. 


aon awe baad 


either case A cannot assign the same level to both 1 and m, 


edges of the MCTG. 


Let <j,i> be an arbitrary directed edge in the MCTG. 
If <j,i> exists in the DAG then clearly A satisfies the 
constraint imposed by it. If <j,i> does not exist in the DAG, 
it follows from the definition of the DAG and the MCTG 
(c.f. section 3) that j belongs to a sibling clique in the 
MCTG. Suppose that j belongs to the same clique as F(?). 
Since in the DAG there is a chain of edges from / to ¢ that 
goes through F(i), A will assign a level to j that is less than 
the level it assigns to 7. The same argument holds if ¢ is a 
normalization operation and j belongs to the same clique as 
Di). Hence A satisfies all the constraints imposed by the 
directed edges of the MCTG. @ 


5. Empirical Evaluation and Discussion 


The MCTG scheduling algonthm ALA _ was 
evaluated empirically using matrices deriving from the 
application domain of circuit simulation. The three examples 
used arose in the simulation of portions of a Digital Signal 
Processor, a Digital-to-Analog Converter, and a Memory 
circuit respectively. The matrices were first reordered using 
the Markowitz ordering scheme [10]. The elementary 
arithmetic operations for GE under this ordering were 
generated and scheduled using a) the conventional DAG- 
based scheme, where the dependencies were determined 
using a symbolic trace of the sequential GE algorithm, and 
b) using the MCTG scheduling algorithm ALA presented in 
Sec. 3. 


Table 1 lists some of the characteristics of the test 
matrices used and presents the results obtained for scheduling 
the triangulation phase of GE. We report the number of 
levels in the generated schedules and the average number of 
operations per level. The former represents the finishing time 
using an idealized CREW multiprocessor while the latter is 
representative of the average degree of parallelism 


exploitable in GE triangulation. The MCTG-based schedule 
can be seen to provide 23% - 39% percent improvement over 
the DAG-based approach for the examples considered. 


TABLE 1. Comparison of DAG-Based Levelization and 
ALA 


No. of Levels 
ie | Sx [arom [SS 
DSP 93 
DAC 
MEM 


The reduction in finishing time obtained with the 
MCTG-based scheduling scheme is comparable to that 
reported in the literature for matrix reordering schemes 
targeted at increasing parallelism in GE. Further, the decrease 
in finishing time obtained by the various proposed reordering 
heuristics is --unlike the MCTG-based approach-- typically at 
the expense of increased total number of arithmetic 
operations [2,3,12]. An interesting open question is whether 
the use of the MCTG-based scheduling scheme will provide 
comparable improvements in the number of levels with these 
matnix reordering schemes targeted at increasing parallelism, 
as it has for schedules based on the Markowitz ordering 
scheme. In any case, comparisons of different matrix 
reordering schemes with respect to the degree of parallelism 
exploitable, should be based on the less constrained MCTG- 
based schedule rather than the conventionally used DAG- 
based schedule. 


The MCTG-based approach also has implications on 
the scheduling of practical finite-processor parallel systems. 
One approach to the parallel execution of GE on a 
multiprocessor is to use barrier synchronization between 
levels in the schedule. The greater average degree of 
parallelism obtainable with an MCTG-based schedule than 
the conventional DAG-based schedule implies better load- 
balancing on a multiprocessor with the former schedule. 
Further, since there are fewer levels with the MCTG-based 
schedule, fewer barrier synchronization points are required. 
Thus less overhead can be expected with an MCTG-based 
schedule, even though the actual performance improvement 
achieved will depend significantly on various machine 
characteristics and implementation dependent factors. It can 
be expected though, that the use of the inherently less 
constrained MCTG model, in conjunction with an appropniate 
characterization of the performance of a practical 
multiprocessor, can lead to more effective scheduling 
schemes than the use of the conventional DAG-based model. 


812 30 23 
147 882 49 30 
587 7048 109 77 


6. Conclusions 


In sum, we have presented a new approach to 
modeling task graphs for scheduling on a shared-memory 
multiprocessor. The key idea is the use of undirected edges 
to connect tasks (such as the additive updates of a matrix 
location) that cannot be done simultaneously but can be 
executed in either order. Such a task graph is inherently less 
constraining than one in which only directed edges are used. 


We have developed a scheduling algorithm for these 
Minimally Constrained Task Graphs, and showed the 
algorithm to be optimal for Gaussian Elimination task graphs 
under an idealized Concurrent Read Exclusive Write 
multiprocessor model. We have empinically evaluated the 
Proposed scheduling algorithm using Sparse matrices derived 
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from circuit simulation of sample electronic circuits, and 
showed it to provide up to forty percent improvement over 
the conventional approach. 


Even though an idealized machine model has been 
used in this paper to present the Minimally Constrained Task 
Graph approach, the approach holds promise in the context 
of scheduling computations on real multiprocessors. The 
refinements required to accommodate characteristics of 
practical finite-processor systems for their effective 
scheduling are open questions for future research. 
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Abstract 


Task graphs of parallel algorithms which are based on 
the divide-and-conquer strategy often exhibit a characteristic 
structure known as the partitioning structure. We present 
some new methods for bounding and approximating the mean 
execution time of a partitioning structure when the execution 
times for the tasks are non-deterministic and compare them 
with previous approaches. Distribution-driven simulation 
results show that two of the methods, the iterative approxima- 
tion and the independent paths approximation, provide accu- 
rate estimates, usually to within 10 percent. Results from 
program-driven simulation of a parallel quicksort algorithm 


running on the Rice Parallel Processing Testbed indicate that 
the methods give good estimates even when certain indepen- 
dence assumptions are violated. The independent paths 
approximation is used to derive an analytical expression for 


the mean execution time of a parallel mergesort algorithm. 


1. Introduction 


A common approach to solving problems is to partition 
the problem into smaller parts, find solutions for the parts, 
and then combine the solutions for the parts into a solution 
for the whole. This divide-and-conquer strategy, applied 
recursively, is the basis for several classes of parallel algo- 
rithms, including a number of sorting and searching algo- 
rithms. These algorithms typically consist of three phases: a 
divide phase during which work is partitioned, a work phase 
during which computation is performed on the partitions, and 
a merge phase during which results from the previous steps 
are combined. Task graphs of such algorithms have a charac- 
teristic structure known as the partitioning structure [1]. A 
classic example of such an algorithm is the quicksort algo- 
rithm which partitions an array of elements to be sorted into 
two subarrays, each of which is subdivided recursively until 
the number of elements in a subarray is below a threshold. 
The work phase consists of sorting the elements in the subar- 
ray. The merge phase is either non-existent (if the partition- 
ing and sorting are done in place), or trivial (if the partition- 
ing and sorting are done on copies). The mergesort is a simi- 
lar algorithm with a non-existent or trivial divide phase and 
non-trivial work and merge phases. 


Figure 1 shows a two-stage partitioning task graph struc- 
ture. Each node in the graph represents a computational task 
and each edge represents a dependency between tasks. A 
task a is said to be a predecessor of a task b if there is a 
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Figure 1: Task graph for a partitioning algorithm 


directed edge from a to b. Tasks without predecessors are 
called initial tasks and tasks that are not the predecessors of 
any task are called final tasks. A task cannot start until all its 
predecessor tasks have completed execution and once started 
a task runs to completion without interruption. The level of a 
task is the length of the longest path from an initial task to 
that task. The execution time for the graph is the time from 
the start of an initial task to the completion of all the tasks. 
The number of stages in a partitioning structure is the number 
of divide levels or the number of merge levels. The branch- 
ing factor is the number of successors to each divide task or 
equivalently the number of predecessors to each merge task. 
Many algorithms have a small constant branching factor, usu- 
ally two or three. The task graph in Fig. 1 has two stages, 
five levels and a constant branching factor of two. | 


_If the execution times of each of the tasks in a partition- 
ing Structure are deterministic, the computation of the execu- 
tion time for the entire graph is be trivial. However, the task 
execution times in real programs are often non-deterministic 
because of queueing delays due to contention for resources 
such as memory or communication channels, and because of 
data-dependent computation times. 


Non-deterministic execution times generally result in 
synchronization delays where one task has to await the com- 
pletion of other tasks. Synchronization delays and 


communication costs are considered to be the most important 
factors effecting the performance of parallel algorithms [2]. 
Our goal is to determine the effect of non-deterministic task 
execution times on the total execution time of the algorithm. 
We will show that, given information about the nature of task 
execution times, it is possible to make accurate statements 
about the mean execution time of a parallel algorithm with 
the partitioning structure by drawing on results from extreme 
order Statistics. 


It is important to be able to determine the effects on per- 
formance of synchronization delays in parallel programs for 
several reasons. First, this gives a lower bound on execution 
time that is independent of the number of processors, the 
structure of the interconnection network, and the communica- 
tion bandwidth. Also, in those cases in which a task that 
becomes ready to execute always finds an available proces- 
sor, if the interprocessor communication times are negligible 
or deterministic, they can be included as part of the task exe- 
cution times to obtain accurate execution time estimates for 
the entire program. Finally, we can compare the perfor- 
mances of algorithms based on their synchronization struc- 
tures which may in turn lead us to better parallel algorithm 
design. 


We make the following assumptions: 


1) There are enough processors, i.e., if a task is ready to 
execute it does not have to wait for a free processor. 


2) Communication costs are either negligible or are incor- 
porated into the task execution times. 


3) The execution time for each task is a random variable, 
and either the probability distribution or at least the 
mean and the variance are known. 


4) The execution times for tasks at a particular level are 
independent of each other and identically distributed 
(i.i.d.), and the execution time of a task is independent 
of the execution time of its predecessors. 


Assumptions 1 and 2 are necessary to isolate the effect 
of non-deterministic execution times on synchronization 
delays. It may not be possible to have complete information 
about the probability distribution of a task execution time, but 
the mean and variance can often be experimentally estimated 
from performance measurements when dealing with real sys- 
tems. The assumption that tasks at a level are identically dis- 
tributed is usually justified since all tasks at a particular level 
in partitioning algorithms do an identical computation albeit 
on different data. However, more often than not tasks at a 
particular level are not independent of each other. 


The rest of the paper is organized as follows. In the next 
section we review some previous work that is relevant in this 
area. We then present five methods for bounding and approx- 
imating the mean execution times for partitioning structures. 
This is followed by an evaluation of the accuracy and appli- 
cability of each of the various methods. The evaluation is 
based on distribution-driven simulations. We also present 
results for a parallel quicksort algorithm running on the Rice 
Parallel Processing Testbed (RPPT). Finally we derive an 
analytical expression for the mean execution time of a paral- 
lel mergesort algorithm and compare its predictions with 
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results from the RPPT. 


2. Previous Work 


Kung [3] in an early work in the area classified parallel 
algorithms as synchronous or asynchronous algorithms and 
analyzed examples of both in detail. Briggs and Dubois [4] 
analyzed the performance of synchronized iterative algo- 
rithms on three different machine architectures. Models 
based on deterministic execution times are discussed by 
Vrsalovic et al. Weide [5] used order statistics to study the 
anomalous behavior of a specific algorithm structure. 


Classification of parallel algorithms based on the struc- 
ture of task graphs has been developed by Mohan [1]. He 
used a hybrid simulation tool, PEP, that accepts distribution 
information about tasks and determines the execution time 
for a task graph. PEP can be used to model resource conten- 
tion by means of simple queuing models. Mohan studied the 


_ partitioning structure in particular using PEP but did not give 


any analytical results for the case when the task times are 
non-deterministic. 


Robinson [6] gave an upper bound for the mean execu- 
tion time of a general task graph under the assumption that 
execution times for tasks at the same level are iid. His 
bound is applicable for any task graph provided the means 
and variances of the task execution times at each level are 
known. A well known result from order statistics (see pages 
57-59 [7] ) states that if i.i.d. random variables Xj, X5, ..., X,, 
have mean wu and variance o” then 


—] 
E [maxx <ut SG 
as (1) 
Robinson used this to derive 
E m1 
Ee Sy a ee; (2) 


+ 1 
j=l (2m ;-1)” : 


where Tg is a random variable denoting the execution time 
for the general task graph G, m, is the total number of tasks 
at level j, 4; and 6; are the mean and standard deviation, 
respectively, of the execution time of a task at level j, and L 
is the number of levels. 


Eq. (2) can be interpreted as follows. In a general task 
graph, tasks at a particular level cannot start execution until 
their respective predecessors in the previous level have com- 
pleted execution. With the restriction that the tasks at a level 
start execution only after all the tasks in the previous level 
have completed execution, an upper bound on total execution 
time for the task graph can be obtained. Loosely speaking, 
Robinson’s upper bound is the mean execution time of a 
modified task graph where all tasks at a level synchronize at 
the end of execution. 


Eq. (2) is a strict but loose upper bound. It can be 
improved if the nature of the distribution of execution time 
for tasks at each level is known. Results for the behavior of 
extremes for some common distributions such as exponential, 
normal, and uniform are applicable under these cir- 
cumstances. We will use Robinson’s bound for comparison 
in evaluating the accuracy of the bounds and approximations 
to be presented. We will also present and use an extension of 
Robinson’s approach based on an expression analogous to (1) 


but dealing with dependent random variables. 


3. Analysis 


We have developed five methods for bounding and 
approximating the mean execution time of a partitioning 
structure. The first uses Robinson’s approach for specific dis- 
tributions. A second bound is based on an expression analo- 
gous to (1) for dependent variables. We then provide two 
approximations for the mean execution time based on the 
number of paths from the initial task to the final task in the 
partitioning structure. Our last approximation is an iterative 
technique that takes advantage of the recursive nature of the 
partitioning structure. All the methods draw on results from 
extreme order statistics. 


3.1. Bounds for Specific Distributions 


Eq. (2) requires that only the mean and variance of task 
times are known. If information is available about the nature 
of the task time distributions, it is possible to improve upon 
(2). If X,, Xo, ..., X,, are iid. random variables distributed 
exponentially with parameter A, then from extreme value 
theory [8] 


(3) 


where y is Euler’s constant ( 0.5772... ). All log functions in 
this paper are natural logarithms. If the variables are uni- 
formly distributed between a and b then . 


(b-a) 
m 


(4) 


If the variables are normally distributed with parameters uU 
and o then 


E maxX; = b- 


<i<m 


(5) 
ees a log m)— log log mtlog 47 vemeanes) See 


2(2 log m)” (2 log m)” 


Eqs. (3), (4), or (5) can be used in place of (1) to obtain 
tighter upper bounds for the mean execution time of a parti- 
tioning structure. These will not be strict bounds since (3), 
(4), and (5) are approximations that become exact for large 
values of m. However, these can be used in deriving approx- 
imate upper bounds that are asymptotically correct. 


This approach cannot be used in all cases since some 
distributions do not have tractable expressions for extreme 
values. An example is the beta distribution. 


3.2. Bound for Dependent Task Times 


Frequently, the assumption that the tasks at a level are 
independent is violated in real programs. For example, in a 
quicksort algorithm the execution times for the two successor 
work tasks of a divide task will be negatively correlated since 
more work for one task would result in less work (a smaller 
subarray to be sorted) for the other task. Under such cir- 
cumstances we can use an expression analogous to (1) (see 
pages 78-79 [7] ) that states 
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E maxX, < u + (m-1)70 (6) 


<i<m 


to give 


L 
E(Tg) = & 
Fl | 
This bound gives a higher estimate for the mean execution 


time than would (2). 


[h, 3 (m-1)"6;| (7) 


3.3. Independent Paths Bound 
Both of the above methods, as well as Robinson’s bound 


_apply to general task graphs and do not take advantage of the 


regularity of the partitioning structure. A partitioning struc- 
ture has a single initial task, a single final task, and b* dif- 
ferent paths from the initial to the final task, where b is the 
branching factor and s is the number of stages. Under the 
assumption that the execution times of tasks at a level are 
ii.d., the execution times for all paths are identically distri- 
buted random variables with mean and variance given by. 
i=2s+1 


Unath = py E(T;) 


(= 


(8) 


i=2s+1 
Opoth= > variance(T;) 
i=l 


(9) 


where T; is the random variable denoting the execution time 
of a task at level i. If the paths are independent of each other, 
the execution time of the partitioning algorithm is the max- 
imum of b* random variables with the given mean and vari- 
ance. Using (1) and (3) we get 

E(TG) S Upath + 


1, Opath (10) 


k-1 
(2k—1) 
where k= b*. The assumption that the paths are independent 
of each other is clearly false since each path shares two or 
more tasks with every other path. Paths which share a large 
number of tasks will have highly correlated execution times. 
Nevertheless, simulation results have shown that this is an 
improvement over Robinson’s bound. 


3.4. Independent Paths Normal Approximation 


The independent paths bound does not make any 
assumptions about the nature of distribution of the path exe- 
cution times. If we assume that the path execution time is 
normally distributed, the execution time for the entire task 
graph will be the maximum of b* i.i.d random variables 
which are normally distributed. Using (5) the expected value 
of the partitioning algorithm can be approximated by 


E (TG) fe Unath 


i 11 
+ Gyan |(2logk) log pelos 4n Y : (11) 
2(2logk)”’ (Qlogk)” 


where k is the number of different paths. 


This approximation will be poor if the number of tasks 
along a path is small or if the execution time for the path is 
dominated by a single task. In either case the normal distri- 
bution assumption will be invalidated. Nevertheless this 
approximation is quite accurate as is shown by comparison 
with simulation results in Section 4. 


3.5. Type I Iterative Approximation 


The recursive nature of the partitioning task graph sug- 
gests an iterative solution. The execution time for an i-stage 
algorithm can be written as follows: 


T; = T divide, + maximum ( Dy j5eid jj ) + T merge; 


where 7; is a random variable denoting the execution time for 
an i-stage structure and Tyiyige, ANd Tinerge, are the execution 


times for the divide and merge tasks from the appropriate lev- 
els. Ty) will be the execution time for a work task. The 
number of terms in the maximum operation is the branching 
factor of G. If we further assume that the three terms in the 
expression are independent random variables we can itera- 
tively sum the means and variances of the three variables pro- 
vided we have information about the behavior of the max- 
imum of several i.1.d. random variables. 


Since determining the execution time for an s-stage 
algorithm using this iterative approach indirectly involves 
taking the maximum of b* random variables we will assume 
that it will tend asymptotically to an extreme value distribu- 
tion. If the maximum of several i.i.d random variables tends 
to a distribution asymptotically it has to be one of three types 
of distributions, usually referred to as Type I, Type II, or 
Type II extremal distributions [8]. Extreme values from dis- 
tributions with an exponential tail behavior tend to the Type I 
or Gumbel distribution, those from distributions with a poly- 
nomial tail behavior tend to the Type II distribution, and 
those from bounded distributions tend to the Type III distri- 
bution. The exponential and normal distributions are exam- 
ples of distributions whose maximum values tend to the Type 
I distribution. 


The Type I distribution has the properties that the max- 
imum of n iid. variables from a Type I distribution will 
remain Type I, and further the distribution of the maximum 
has the same shape as the distribution of the 11d. random 
variables but is shifted to the right. In particular 


u,=p+t Oo logn 
04 


6, =6 
2. We 
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where tt and o are the mean and variance, respectively, of the 
Type I distribution, L,, and o,, are the mean and variance of 
the extreme value distribution for ” variables and © is a 
shape parameter. The mean is increased while the variance 
remains the same. Under the assumption that the execution 
time for an i-stage algorithm has a Type I distribution and 
that the distribution will remain Type I even with the addition 
of the divide and merge time distributions the mean execution 
time can be computed iteratively as follows: 


begin 


E [Tol =E [Tork 
Oo = Swork 


fori = 1 until s do 
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begin 
E(T;] = ElT givide) + ELT -1] 


V6 log b 6;_, 
ea ] 


merge; 


Zicgaad 2 2 
Oj = Odivide; +07) + Omerge; 
end 


end 


The addition of variances follows from the assumption that 
task times are independent of their predecessor tasks. 


The Type I Iterative method makes use of the fact that 
there is a simple relation between the means and variances of 
a Type I distribution and its extreme value distribution. Such 
convenient relations are not available for Type II and Type 


‘If distributions. In particular the variance of the extreme 


value of a Type II distribution increases as the number of 
terms in the maximum operation is increased. 


4. Performance Comparison 


We evaluated the methods presented above by com- 
parison with simulation results and with Robinson’s bound. 
We first present distribution-driven simulation results for the 
exponential, uniform, and beta distributions. Results 
predicted by each of the methods are compared against simu- 
lation values to determine the accuracy of the methods. We 
then present a comparison of three of these methods with 
simulation results for a quicksort algorithm running on the 
Rice Parallel Processing Testbed (RPPT). The quicksort 
algorithm violates the independence assumptions on which 
the methods are based. Nevertheless, as will be seen, the 
methods predict the mean execution times with reasonable 
accuracy. 


4.1. Stochastic, Independent Task Execution Times 


Results from simulation and analysis were obtained for 
three different distributions, namely, the exponential, uni- 
form, and beta distributions. In each case the three task 
types, divide, work, and merge, were assumed to have the 
same type of distribution and the parameters were chosen 
such that the mean executions times would differ by an order 
of magnitude. 


Fig. 2(a) and 2(b) are graphs of the mean execution time 
of as a function of the number of stages for the exponential 
case. Results obtained from Robinson’s bound, the indepen- 
dent paths (IP) method, the independent paths normal approx- 
imation (IPN), the Type I iterative approximation, and the 
simulation are shown. The exponential distribution has a 
well known extreme value behavior, and the curve labeled R 
+ Dist..in the graph is for results predicted by Robinson’s 
bound when modified by extreme value formula for the 
exponential distribution. Fig. 2(a) gives results for the case 
where the divide and merge tasks are exponentially distri- 
buted with A4=10 and the work task is exponential with 
X= 1. Fig. 2(b) gives results for the case where all three task 
types are exponentially distributed with A = 1. 


The following observations can be made from the 
results. Most methods predict the mean execution time 
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Figure 2: Comparison of bounds and approximations for exponential distribution 
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Figure 3: Comparison of bounds and approximations for uniform and beta distributions 


accurately for up to a four stage structure. Beyond four 
stages, Robinson’s bound diverges rapidly from simulation 
values. The IP method is better than Robinson’s bound but it 
also diverges. Additional information about the distribution 
improves Robinson’s bound ( R + Dist. ) considerably. The 
independent paths normal approximation ( IPN ) is accurate 
to within 30 percent in Fig. 2(a) and is accurate to within 5 
percent in Fig. 2(b). This is due to the fact the choice of 
parameters in Fig. 2(a) results in a poor normal approxima- 
tion to the path execution time since the path time is dom- 
inated by a single task, the work task. The Type I iterative 
method is the most accurate of all and is within 5 percent of 
the simulation results in both figures. 


Figs. 3(a) and 3(b) give similar results for the uniform 
and beta distributions, respectively. In both cases parameters 
for all task times were chosen to be identical. The Type I 
iterative method is not applicable for either distribution. No 
results are given for R+Dist. in Fig. 3(b) since formulae for 
extreme value behavior of the beta distribution are unavail- 
able. Both these graphs reinforce our earlier observations. 
The IPN method proves to be highly accurate in both cases, 
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predicting the simulation values to within 5 percent. 


If a strict (asymptotic) upper bound is required and the 
task time distributions are known, the method of choice is 
Robinson’s method as modified by distribution information. 
However the extreme value behavior is not analytically avail- 
able for all distributions and one has to revert to Robinson’s 
original bound in these cases. The IPN method requires only 
the means and variances of task execution times and provides 
a good approximation even when the normal distribution 
assumption for path execution times is not satisfied. The 
Type I iterative is highly accurate but can be justified only for 
distributions whose extreme values converge to the Gumbel 
or Type I asymptotic distribution. 


4.2. Program Driven Task Execution Times 


Program driven results were obtained for the quicksort 
algorithm running on the Rice Parallel Processing Testbed 
(RPPT) [9]. The RPPT is a software simulation tool that 
facilitates the performance evaluation of parallel programs on 
parallel architectures. Parallel programs are analyzed for 


timing information at the assembly language level. The pro- 
gram then drives an architectural model to provide accurate 
statistics about resource usage and execution times. 


The architecture used for the RPPT simulation was a 
single bus, shared memory multiprocessor with enough pro- 
cessors so that ready tasks did not have to wait. During the 
simulation, statistics on individual task execution times were 
collected for a task at each level. Communication times, 
which were negligible, were included in the task execution 
times. 


The partitioning quicksort algorithm works as follows. 
The divide tasks partition the input array using a median ele- 
ment and start further divide or work tasks depending on the 
level. Each work task sorts its input array using quicksort. 
The merge tasks are trivial and simply terminate after inform- 
ing the next level merge tasks. The number of stages of the 
algorithm was varied from 1 to 8 and the algorithm was run 
with random integer input. 
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Figure 4: Quicksort of 8192 integers 


Fig. 4 shows the execution times for the quicksort algo- 
rithm along with the predictions by Robinson’s method, the 
IPN method, and the iterative method. The tasks at a particu- 
lar level are not independent and tasks close together are 
highly correlated. Thus the independence assumptions on 
which all the methods are based fail. However, it is possible 
to correct for dependence among tasks at a level for the 
Robinson’s method using eq. (6). The plot in Fig. 4 reflects 
this. Results for the iterative method are presented even 
though the distribution types are not known and the indepen- 
dence assumptions are violated. Nevertheless, the iterative 
method gives results that have accuracy comparable to the 
IPN method. 


The results show that Robinson’s method when 
-corrected for dependencies still bounds the simulation values 
from above. The IPN and iterative methods are close 
together and consistently underestimate the mean execution 
time. This is to be expected since the quick sort algorithm 
violates the independence assumptions. 
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5. Analysis of a Parallel Mergesort Algorithm 


The IPN approximation is a useful method for numeri- 
cally estimating program execution times. It can also be used 
as a basis for finding an analytic expression for the expected 
execution times of parallel programs with a partitioning 
structure. We now derive and expression for the mean execu- 
tion time of a mergesort algorithm using the independent 
paths normal approximation method. Predictions from the 
analysis are compared against simulation results from a mer- 
gesort algorithm running on RPPT. 


The mergesort algorithm is an example of a partitioning 
algorithm which has no divide tasks. Each work task sorts a 


subarray of size - where N is the number of elements to be 


sorted, and k is the number of work tasks. Each merge task 
accepts two sorted subarrays of equal size from its predeces- 
sor tasks, merges them into a single sorted array, and passes it 
to its successor. Task execution times at a particular level are 
i.i.d. since all tasks at a level do the same type of work but on 
different sections of data. 


Let the number of work tasks be k, a power of two, and 
let N be the number of elements to be sorted. Assuming that 
the time to mergesort an array of size N has mean a,Nlog(N) 
and standard deviation b.N ee and the time to merge two 
subarrays of size N has mean 2a,,N and standard deviation 
b,, [10] we get: 


N N 1 
Unath = a lost|) + AnN(——") 
and 


2 N 
Opah = bs —- + bplog(k) 
where Ungr, and On ath are the mean and variance respectively 
of the execution time for a path from a work task to the final 
merge task. Using the independent paths approximation we 
obtain 


E|mergesort] = Upath 


+ Sain” (2 log kb) - log log k + 108 4% owen Sanne 
2(2 log k)” (2 log k)” 
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Figure 5: Mergesort of 8192 integers 


Fig. 5 shows the results from the IPN analysis and from 
an RPPT simulation of the mergesort algorithm. The 
coefficients a,,, b,,, a,, and b,, were estimated from task time 
measurements made during the simulation. The aualyets and 


simulation agree to within five percent. 


6. Conclusion 


Results from extreme value theory are applicable in 
predicting the execution times of certain parallel program 
structures. Two of the methods we have presented, which are 
based on Robinson’s approach, can be used to bound the 
mean execution time of a general task graph. The three other 


methods, the independent paths approximation, the indepen-_ 


dent paths normal approximation, and the iterative approxi- 
mation, can be used to approximate the mean execution of a 
parallel program with a partitioning structure. The IP 
approximation is empirically shown to be slightly better than 
Robinson’s bound and of a general applicability since it does 
not require complete information about the probability distri- 
bution of task execution times. The IPN approximation gives 
excellent results and is again of a general. applicability as has 
been shown by our analysis of a parallel mergesort algorithm. 
The iterative method is very accurate, but is based on task 
time distributions whose extreme values tend to the Type I 
asymptotic distribution. However, simulation results show 
that the iterative method gives good results even when the 
distributions are unknown and independence assumptions are 
violated. 


Since bounded distributions are common in practice, an 
iterative method for the Type III asymptotic distribution 
would be of considerable interest and we are currently work- 
ing on such a method. We are also studying the problem of 
predicting the execution times of other parallel program 
structures such as the multiphase algorithms and general 
pipeline algorithms. Another area of future study in parti- 
tioning algorithms is to consider the effects of non-negligible 
-communication costs and resource contention, both for pro- 
cessors and communication bandwidth. 
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Abstract — A new technique for the parallel execution of 
branch-and-bound algorithms using ‘“‘randomization” is proposed. 
The algorithm requires relatively little inter-processor communica- 
tion, while achieving good speedups over the uniprocessor execu- 
tion times; in precisely those cases where the problem size becomes 
very large, randomization is found to be extremely successful in 
achieving very good speedups. A probabilistic model has been 
devised to explore the effectiveness of this technique, and to esti- 
mate the expected speedups that could be obtained. The model 
has been validated by extensive simulation work on a multiproces- 
sor simulator. Besides being very simple to implement, the tech- 
nique also ensures high reliability, flexibility, and fault-tolerance. 


I. INTRODUCTION 


Branch and bound has often been the algorithm of choice 
for the solution of combinatorial search problems. Some of the 
classical combinatorial problems involving discrete optimization 
include the Travelling Salesman problem, the Knapsack problem, 
Job Scheduling, and Integer Linear Programming. These problems, 
which occur commonly in different forms im diverse areas belong to 
the class NP-hard [4]. We propose a ‘randomized’ parallel 
branch-and-bound algorithm, which, by introducing an element of 
randomness into the conventional branch-and-bound algorithm, 
makes parallelization simple, yet effective, while, at the same time, 
ensuring that interprocessor communication does not get out of 
hand. Additionally, a system employing this randomization tech- 
nique would have good fault tolerance and reliability, in the event 
of processor failures, and can be easily expanded by adding more 
processors. 


In the next section, we describe the randomized algorithm 
vis-a-vis the conventional one. Section III establishes a simple 
model that is used effectively in Section [V to estimate the speed- 
up performance of the randomized algorithm. Section V then 
presents some simulation results that were obtained from a mul- 
tiprocessor simulator. Finally, conclusions and suggestions for 
future work is given in Section VI. 


HO. THE RANDOMIZED BRANCH-AND-BOUND ALGORITHM 


The branch-and-bound algorithm has been exhaustively 
described and analyzed in the literature (See for example, [4]). A 
rigorous analysis of the branch-and-bound procedures in [8] shows 
that the expected time to solve some problems is a polynomial 
function of time. Stone and Sipala [9] present related results. 
Wah and Yu have given a stochastic model for the branch-and- 
bound algorithm in [10]. 


The branch-and-bound process may be visualized as 
searching a branch-and-bound tree, with each node representing a 
subproblem, each leaf representing either a feasible solution, or a 
sub-problem that will not yield a possible solution. Each node will 
have a value associated with it which is simply the value of the 
bounding function, g. From this point on, we will assume, without 
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loss of generality, that a solution with the minimum cost is being 
sought. Further, we assume that the selection rule results in the 
search of the branch-and-bound tree in a depth-first fashion. Our 
method seems to be applicable to best-first searches also, as will be 
apparent, but we are yet to study its performance. 


The success of the branch-and-bound algorithm stems pri- 
marily from the fact that the bounding function, g, can be used to 
remove from consideration, fairly early in the proceedings, sub- 
problems that will not yield solutions better than the one at hand. 
As time progresses, successively better solutions will have been 
found, and with this ‘experience’ gained, the branch-and-bound 
algorithm is able to eliminate progressively larger sub-trees. Thus, 
the branch-and-bound algorithm dynamically prunes the search 
tree. The branch-and-bound algorithm just described is called 
Ifob, (left first branch and bound). 


A parallel version of this conventional branch-and-bound 
algorithm has been considered by Wah and Ma [11]. MANIP is a 
special-purpose machine proposed by them solely for the solution 
of branch-and-bound problems. The selection rule used here is 
best-first, but depth-first search is used when secondary memory 
runs out. They have indicated that simulation studies have shown 
a speedup of k, for k processors. However, the speedup is calcu- 
lated only with respect to the number of iterations; important fac- 
tors such as (secondary) memory access times which could have a 
serious impact on the performance have been neglected. 


_ El-Desouki and Huen [3] have considered a scheme some- 
what similar to the one we are about to describe. Here, the work- 
load of each processor is deterministically apportioned at the . 
outset. Suppose that (as is very likely) one processor finishes exa- 
mining its portion of the solution tree before the others. It then 
enters into a complicated and lengthy dialogue with each of the 
other processors to determine which particular portion of the solu- 
tion tree is most suitable for exploration. Not only is there extra 
communication cost involved here, but also an unknown amount 
of extra computation overhead incurred by all the processors. 
Another serious drawback of this scheme is apparent when consid- 
ering what would happen should a processor fail: In such a case, a 
portion of the of the tree will remain unexamined, thereby produc- 
ing erroneous results. The method to be described avoids these 
problems. 


Consider the conventional branch-and-bound algorithm, 
fbb. Instead of choosing the next node to be evaluated in left-first 
fashion, we will ‘‘randomize” the search by making a random 
choice for the next node from among the unexamined children. 
The cost of the best feasible solution available currently is made 
available globally to all processors. Other details of the algorithm 
remain unchanged. This algorithm is called rsbb (for random 
search branch and bound) and is given below: 


minfeas : real { Cost of the minimum feasible solution } 
function rsbb (instance) | 
cost « g(instance) { bounding function } 
if (cost < minfeas) then 
if (not feasible(instance)) then 
repeat 
nextinstance « { randomly chosen child of instance } 
rsbb(nextinstance) 
quit « { all children of instance have been examined 
} 
until (quit) 
else 
minfeas « cost 


It can be shown that the expected solution time on a uni-processor 
using rsbb is the same as that of [fbb for a very general class of 
problems. However, the advantage of rsbb is that, under parallel 
execution, the decision as to the next node to evaluate can be 
made locally, without global information. If we use k processors, 
for example, each processor will, in the main, examine a different 
portion of the tree. Of course there is a finite probability that 
replication of work will occur, but by allowing a very small 
amount of simple and terse communication between processors, 
this probability can be kept relatively small, and good speedups 
can be achieved. Another advantage of the decentralized opera- 
tion is that fault-tolerance and easy expandability are inherent in 
such a system. 


Randomization techniques have been used as effective 
heuristics to obtain sub-optimal solutions for many combinatorial 
problems [7]. Stochastic annealing is one example. The paralleli- 
zation of these kinds of searches would be an interesting research 
area. We, however, consider here search algorithms for optimal 
solutions. We use randomization not as a search heuristic, but as 
a scheduling technique. 


_ Janakiram et al. have described a randomized parallel ver- 
sion of the backtracking algorithm in [5], where a similar tech- 
nique has been employed. Three classes of backtracking problem- 
types have been identified, and for each class the expected speedup 
has been determined. The expected speedup has been shown to 
vary from (k+1)/2, to k (k is the number of processors), depend- 
ing on the problem-type. | 


In the next section we discuss the randomized branch- 
and-bound algorithm in more detail, and attempt to derive a 
model that is used estimate the speedups obtainable by using 
parallel rsbb. : 


WM. MODEL DESCRIPTION 


We will first establish that the expected solution time for 
a branch-and-bound problem solved using /fbd is the same as when 
using rsbb. 


A probabilistic model for the kind of solution trees gen- 
erated by the branch-and-bound algorithm has been given by 
Smith [8]. A random branch-and-bound tree may be generated as 
follows: 


(i) Let a root node exist, which is unsprouted (level=0). 
(ii) Each unsprouted node, n, at level:7, is sprouted as follows: 
Let n have S children, where S is a random integer whose 
probability mass function (p.m.f.) is given by pg(t;7), 
(ps(t;#=0)>0). Each node will be assigned a cost that is the 
sum of the node costs of its ancestors, together with a ran- 
dom number called Q, whose distribution is given by P(t). 
(Pg (t=0)=0). 
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(iii) Repeat (ii) until there are no unsprouted nodes. 


The above procedure is slightly more general than that 
given by Smith. The node cost obtained in this way is the simu- 
lated equivalent of the value of the bounding function g, described 
in the last section. The cost of a particular node is at least equal 
to that of its parent. The increase in node cost (viz., the r.v. Q) 
over its parent is termed the incremental node cost, and is a non- 
negative number. 


Consider a random tree being searched by a processor 
using rsbb. Suppose that the left sub-tree is selected first, and it 
happens that the best solution lies in this subtree. Let I(1) be the 
number of nodes examined by rsbb in this instance. Now consider 
the case when two processors are deployed to solve the same prob- 
lem. It is possible that both processors choose the right sub-tree 
first, and the two processor system solves the problem in (2) 
steps, where J(2)=J(1). This is the kind of anomalous behavior 
observed by Lai and Sahni [6], for the best-first branch-and-bound 
algorithm. For parallel rsbb, however, we prove the following: 


Theorem 1: Let the expected speedup obtained by parallel rsbd, 
1, (ky = ke = 1), 


even if there is no inter-communication between processors. 0 


using k processors be 5,. Then S, = 54, 2 


Theorem 2: For all the trees generated by the above procedure, the 
probability distribution of the solution times using depth-first 
search, is independent of the order in which successive nodes are 
chosen. O 


It is thus clear that [fbb and rsbb take the same time to 
solve a problem, on average. The next step is to estimate the 
expected solution time of (fbb. This has been determined by Smith 
[8], and by Wah and Yu [12]. The latter paper has given a con- 
ceptually satisfying model of the branch-and-bound process which 
is also very accurate. The extension of this model to the multipro- 
cessor case, however, does not seem to be possible. We have, 
instead, devised a simpler model, which is adequate for our pur- 
pose, viz., to estimate the speedup on a parallel system. 


First, some assumptions are made regarding the structure 
of the branch-and-bound tree which are similar to those made by 
Wah and Yu: 


(i) We assume that the solution tree is finite and of constant 
degree. As trees of degree two are the most commonly 
occurring ones, we have confined ourselves to binary trees; 
but this is not a restriction on the analysis. 


We assume the tree is full, and of depth D. 


The incremental node cost is assumed to be ezponenitially dis- 
tributed. This assumption is made not only because it 
makes the mathematics tractable, but because this has been 
the distribution that has been observed in practice. Wah 
and Yu have shown that it occurs in the Knapsack problem 
[12], and in integer programming [10], while Smith [8] 
reports that for the Travelling Salesman problem, the distri- 
bution is geometric, which is the discrete analogue of the 
exponential distribution. 


(ii) 
(iii) 


Each node of the branch-and-bound tree is identified by 
the pair (7,7), where i is the level and 7 the serial number of the 
node, beginning from the left with 0. As (fbb proceeds, it will fol- 
low the path (0,0),(1,0),...,(D,0), where D is the depth of the 
tree. Node (D,0) is a feasible solution. Now (fbb will backtrack 
and look for other feasible solutions, each time updating the vari- 
able minfeas, which contains the cost of the best possible solution 
obtained thus far. When /fbb encounters a node whose cost hap- 


pens to be greater{ than, or equal to, minfeas, it will simply dis- 
card the children of this node. 


Let N,;; be the number of nodes examined by /[fbb in a 
tree rooted at (i,j). We want to find Noo, the number of nodes 


examined in the entire tree. N; ; can be expressed as a recurrence: 
Ni, pea ea Ps,5(Ni+1,2; + Ny+1,25 +1) 3 
Poo = 1; 
Np,j = 1. ..(1) 


where p; ; is the probability that the node cost of (7,7) is less than 
minfeas. 


Eq. (1) states that [fbb must examine node (7,7), and then 
with probability p,;,;, examine the subtrees rooted at its children. 
The root (0,0) is examined with probability 1 (po 9=1) and a leaf 
node is terminal (Np,;=1). The computation of p;,; is now 


described. 


Consider the scenario in Fig. 1. Here, lfbb has encountered . 


the | solutions in subtrees T,, T3, and 74, and is now examining 
node a, deciding whether to select the subtree rooted at a for 
searching or to backtrack. The best feasible solution, will be the 
one with minimum cost among those obtained so far. Each solu- 
tion is the sum of D (not necessarily independent) random vari- 
ables, each random variable being the incremental node cost. 


An exact analysis would proceed as follows: First, the pro- 
bability distribution of the cost of the best solution in a given sub- 
tree must be found. This may be represented as a recurrence rela- 
tion: 

t 
Freay(t) = 1 - [1- ff 9*freca-1(2) dz}? ..(2) 
0 
where 
Fy,c(a)(t) is the probability distribution of the best solution 
cost in a subtree of depth d. 
faq(t) is the probability density of the incremental node 
cost and, 
* stands for convolution. 


The convolution in Eq. (2) represents the probability den- 
sity function of the sum of the incremental cost of a node and the 
random variable that is the minimum cost solution in the sub-tree 
rooted at this node. The minimum solution cost in a tree is then 
the minimum of two random variables (for the right and left sub- 
trees), each of which is distributed as the sum just described. 


Next, using the tree in Fig. 1 as an example, the probabil- 
ity of the following two (in this case) events must be calculated 
(the g’s denote the incremental node costs): 


(i) {The minimum cost solution in tree T, + q';-1} <= 
{qi-1t a}. 


(ii) {Minimum cost solution in T, + q’;}S {qi}. 

Let p, be the probability of both events (i) and (ii) occur- 
ring, i.e., the cut-off probability at node a. Then p;, for a is 
simply (1~p,). Unfortunately, the analysis, if pursued as above, 
quickly becomes intractable. The treatment we follow makes cer- 
tain simplifying assumptions. The results thus obtained are 
approximate, but are of sufficient accuracy for limited purpose for 
which they will be used. 


tRecall that we are dealing with minimization problems. 
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Suppose that / solutions have been found by the time node 
a, at level m is encountered. (See Fig. 1). Now the cost of a par- 
ticluar solution is the sum of D incremental costs. This sum may 
be considered in two parts: (a) The sum of the incremental cost of 
the node at level m—1, and the cost of its parent, and (b) the sum 
of the remaining incremental costs. Although, in general, node a 
and all the solution nodes before node a is encountered may not 
have all their ancestors in common, we assume that this is indeed 
the case, and thus ignore the contribution of these ancestors to the 
variation in node costs amongst the solution nodes, and also node 
a. The contribution to the solution cost under part (b) above will 
be due to the nodes at and below level (m—1). Many of these 
ancestors are common to the / solution nodes. However, we will 
assume that these ancestors are not common, i.e., the events that 
lead to the generation of the solutions are independent. The 
approximations made in computing the sums (a) and (b) are 
“opposite”, in a sense. In (a), we assume complete dependence 
when independent events exist, while in (b) we assume indepen- 
dence when dependence exists. Hence, the errors caused by these 
approximations could be expected to compensate each other, to 
some extent. 


A chain of nodes, then, from level m—1 to level D will 
have on it some n, (n=d—m+1), nodes. The contribution of 
these n nodes to the solution cost will be the sum of n random 
variables, each of which is the incremental cost of each node. This 
sum is termed the length of the chain. The situation is depicted in 
Fig. 2. There are / such chains, corresponding to the ! solutions. 
By the assumptions made earlier, the length of each chain is an 
independent random variable. If the minimum length of these | 
chains is less than, or equal to, the incremental cost of node a, 
then we conclude that there is no solution in the subtree rooted at 
a that is any better than the best one found so far. Further 
exploration of this subtree is useless, and hence node a will be cut 


off. 


Let p, be the probability of cutoff, and Q@ a random 
number that is the incremental node cost. Then, 


dp, = Pr{min(cost of / chains) < t} 


x Pr{t<=Qst+dt} 


= {1-11-00 | fa(t) at ..(3) 


Fog,n(t) is the probability distribution of length of a 
chain. A chain is made up of n nodes, each of which contributes a 
random amount Q to the length. @ has a density function f g(t). 
So, 
t 
Fon(t) = f7§(2) de (4) 
0 
where the power is in the sense of convolution. The cut-off proba- 
bility can then be written: 


p.(n,l) = f {1-[1-f f(z) dz]'} fa(t) dt 


0 


(5) 


Because of the reasons adduced earlier, we set f g(t) to be 
the exponential! distribution: 


fa(t) = he At 
Then (5) becomes: 


n 


>s £4 e(t+1)t dt 
k=o *! 


..(8) 


p.(n,l) =1 —- 


Ot, 8 


Although the integral in (6) can be obtained for small 
values of ! and n by applying the multinomial expansion, this 
technique is impractical for large !. (1 may take values in the 
order of 10°). An alternative is to use the Guass-Laguerre quadra- 
ture. But for large |, this method gives rise to significant inaccu- 
racies. An asymptotic expression may, however, be derived for 
large |, using the Laplace method [2]: 


r{ 
© 7 aveg n+1 
pe(n,t) ~ 1-H fete Ae 
nt+1 a! $+1 
pst ..(7) 
where 
1 


p = [(n +1) **2 


The convergence of the infinite sum in (7) is quite rapid: 
when /~10n, about four terms are required for a 1% error. For 
larger /, an even smaller number of terms is sufficient. 


The quantity p; ; is related to p,(n,l) by: 
Pig = 1 — p.(d—-t+1,724-4) 


(8) 


At this point all the elements are in place to solve the ori- 
ginal problem given in (1), using (8), (7), and (6). The expected 


number of nodes examined by the branch-and-bound algorithm, 
for trees of various depths can be calculated. The results are 
shown in Table 1, which compares the model with a simulation 
using 1000-2000 trials on random trees. For trees of depth 10 and 
below, the discrepancy between the model and the simulation is 
about 5%, increasing to about 14% for a depth of 12. 


Suppose there are two processors working on two different 
subtrees, each exchanging information as to the cost of the best 
solution. Then the decision whether to cut off at some node will 
have the benefit of both these searches: twice the number of 
independent chains have been generated. The effect of this mutual 
assistance can be represented in our model by simply replacing / 
by kl in (6), where & is the number of processors. The result is 
the speedup given that each processor has chosen a different sub- 
tree to work upon. In the next section we will derive an expression 
for the unconditional expected speedup for a k-processor system 
with global memory. 


IV. SPEEDUP CALCULATIONS 


We now proceed to use the results obtained from our 
model to estimate the expected speedups. In the derivation we 
have neglected to account for the queueing and communication 
delays. In the next section, we observe that these are quite small. 


Suppose some k processors are deployed to search the tree 
using rsbb. At each level each processor makes a random decision 
regarding the next child to examine., Given a sufficiently large 
tree, there is but an infinitesimal probability that two processors 
follow the exact same path while searching the tree. Of more con- 
cern is the likelihood that a processor will take a path already 
trodden earlier by another processor, resulting in replication of 
work. To avoid this occurring, a global list is maintained that 
keeps the status of the subtrees at each level. It is easy to show by 
calculating the probability of the occupancy numbers, that the 
level to which this list must be maintained does not need to be 
large. In the next section it will be shown that a list maintained 
for the first five levels is adequate for up to 10 processors. 
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We define the following terms. Let S, be the expected 
speed up using k processors; T(j,n) be the expected time taken by 
j processors to search the left subtree, which has n nodes; and 
T'(j,n) be the expected time to search the right subtree, after the 
left has been searched. The time to examine one node is taken as 


1 unit. Define y=T'(j,n)/T(j,n). It is clear that 
T(j,2n) = T(j,n) + T'(jyn) +1 


T(j,n) + T'(j,n) (for large trees) 


So, ignoring the error due this approximation, 


T(7,2n) _ 


T(j,n) case 


..(9) 


Eq. (9) expresses the ratio of the time that j processors 
take to search a tree of 2n nodes to the time taken for a tree of n 
nodes. Evidently, this ratio is independent of 7. Eq. (9) may thus 
be used to estimate y from Table 1, which gives an average value 


of y~0.4. Extending (9), 
Tj ,2"7) 

T (7,1) 
Now, 22+1=N is the total number of nodes. Let 2*t!=n be the 


size of the subtree searched at time t. Then, 


T(j,n) = T(7,N)(1t+-y) oe" 8% 


(1t+y)it? ..(10) 


..(11) 
which gives: 
..(12) 


where: 
B(t)=n/N = fraction of nodes examined at time ¢, 
Ty = time to search the entire tree of N nodes, and, 
a = I/og,(1++). 


Suppose there are some k processors working on one part 
of the tree, with / processors working elsewhere in another sub- 
tree, assisting these k processors. Let S,, be the expected speedup 
obtained in the first subtree. (S,=5,.9). At the node rooted at 
this sub-tree, the & processors will randomly choose their next 
sub-tree to examine. Of the k, some k, will choose the left sub- 
tree and ko, the right. Let p; be the probability of this partition, 
{k1,k2}. Without loss of generality, we can assume that ky=ko. 
Let o; be the speedup, given this partition. 

Skt= Pi 0; ..(13) 
$ 
where the sum is over all the possible partitions. Consider the 
partition {k,,ko}; ky=hke, kg#0. The k, processors — now assisted 
by [+2 processors — will finish searching the left subtree consist- 
ing of N/2 nodes in an expected time t) = T(k,,N/2). 


T(1,N/2) 


Sk l+k, 


1 T(i,N) 


7 1+¥y Sk ttk, 


ty= T(k,,N/2) 
..(14) 


In time ¢,, the k. processors have finished some fraction 
B(t1) of their work. From (11), 


ty is 
tij= | 
Blt) | T (ko,N/2 | 
Now, T(k2,N/2) can be written as: 


T(ky,N/2) = 2ULN/2) 


ky l+k, 


..(15) 


T(1,N) 1 


= ee rn ..(16) 
whence, 
Sk l+k, 
Blt) = | ..(17) 
k ,l+k, 


After time ¢,, all the processors will combine to work 
upon the unfinished portion of the tree. Let t, be the expected 
time to finish the remaining work. By properly juxtaposing the 
B-curves for ky and ky +k» processors, tg can be calculated: 

1 


T(k,N/2) — T(k,N/2)R ° (t;) 
NY A ae 
Skt “1+ [1 B (t,)) 


The total time to search the tree, T(k,N) is ty+¢g and is given 
by: 


tl 


to 


II 


..(18) 


1 


r(e,wy = 2tLN)|_1_ 1—B* (#1) 
It+y | Se tte, Skt 


Transposing terms, the speedup o,, for a given partition can then 


be written: ; 
1 


1 1-B° (ts) 
oo; = 1 + ES ..(19) 
er) Sk tek, Skt 
Now, (13) can be expanded as: 
Skt = Po Sk + DP i ..(20) 


where the summation is over all partitions. 


Eq. (20) expresses a recurrence relation for the expected 
speedup. The calculation of the probability p,, of a particular 
partition {k, k2 } is a straightforward matter of considering the 
probability of the occupancy numbers: 


a cL Sage See 
ky!k,! ; gk-1 ; ky Fk 
i ee ..(21) 


(eye ae 


The terminating condition for the recurrence, viz., the set S$, can 
be obtained from the model as explained in the previous section. 
Table 2 gives the speedup calculated in this fashion, and Fig. 3 
shows a plot of these values. Setting 5, ,=1, (all /), gives the case 
when only local versions of the best solution are kept by each pro- 
cessor. 


In the next section, we will describe the simulations and 
present results obtained. 


V. SIMULATION RESULTS 


The performance of the randomized branch-and-bound 
algorithm was obtained on a multiprocessor simulator MPSIM [1], 
(implementing the PRAM-CREW model), running under ULTRIX 
on a MICROVAX. The randomized algorithm rsbb is implemented 
in C. The software can be divided into three parts: (i) The mul- 
tiprocessor simulator, (ii) the skeletal algorithm rsbb, and (iii) 
problem specific procedures and data-structures. Mutual exclusion 
is enforced by the use of a monitor. 
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The problem that was chosen was the 30-element 0/1 
knapsack problem [4]. The knapsack problem was chosen because 
of its relative ease of formulation. In order to fully explore the 
possible range of the problem space, we have used a suggestion of 
Horowitz and Sahni [4]. Six sets of 50 problems (300 in all) were 
used, with the problems in each set being randomly generated. 
The sets are described in Table 3. To determine the effect of keep- 
ing a global list of the nodes visited, the same problems were 
solved using three different sizes of lists: (i) For the first 5 levels 
(31 nodes in all), (ii) The first 4 levels (15 nodes), and (iii) No glo- 
bal list. The results are shown in Table 4. 


An important observation about the behavior of the algo- 
rithm is that large speedups occur in precisely those problem 
instances that take large times to solve. Fig. 4 shows a scatter- 
plot of the speedup obtained versus the solution time for a single 
processor, where this trend is readily discernable. 


A detailed analysis of the trace files produced by the simu- 
lator was done in order to obtain real time information. Table 5 
shows the speedups obtained as ratios of actual times. Unfor- 
tunately, the size of the trace file thus obtained grows enormously 
as the solution time or the number of processors increases, and is 
beyond the capacity of our machine. We have, hence, been able to 
obtain results only for a few of the problems, and only for up to 
five processors. A comparison with Table 4 shows that the degra- 
dation because of global memory accesses is quite small. The 
analysis for these instances showed that, on average, processors 
spent about 20% of the total time in accessing the global location 
containing the cost of the best solution; and about 8% of the time 
accessing the global list of searched subtrees (for a list size of 15 
nodes). It should be noted in this connection that for the Knap- 
sack problem relatively simple bounding functions and branching 
rules can be formulated which incur little computation costs, as 
compared to, say, the Travelling Salesman problem. The global 
memory access percentages would be even better in the latter case. 


VI. CONCLUSION 


We have proposed a randomized version of the branch- 
and-bound algorithm suitable for multi-computing. With little 
communication overhead, we are able to obtain reasonable speed- 
ups for small numbers of processors. The proposed method has 
the following advantages: 


(i) It is application independent, and transparent to the user. 
(ii) Each processor is capable of solving the problem by itself. 
So if any processor were to fail, the other processors would 
still be able to perform unhindered. The overall performance 
would depend on these working processors. The same would 


apply if the communication links should fail. 


The scheduler is very simple to implement, and is very flexi- 
ble. There is no need to modify it when more processors are 
added. Changes due to modification of global memory size 
are trivial. 


(iii) 


A system based on deterministic search would be hard 
pressed to provide these advantages. 


A model was devised to estimate the speedups, and 
predicts the behavior of the system to a good degree. Finally 
detailed simulation results were presented, which show the actual 
performance of the new system. 


Future research will be directed towards exploring the 


‘best-first branch-and-bound algorithm as a candidate for randomi- 


zation. The present model neglects delays caused by queueing and 
communication times. A better model must be devised which will 


be able to account for these delays. 
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Fig. 4: Scatter-plot of Solution Times for Data-set 2. 


TABLE 1: Expected number of nodes examined TABLE 2: Expected Speedups (Predicted) 


TABLE 4: Average Speedups Obtained from Simulation 


Data 
Set Processors 
No. 


Speedup 


Range of 
Uniprocessor 
Solution Times 


EH 
Hi 


Notes: (i) For a description of the data sets see Table 3. 
(ii) Time to examine a node is taken as 1 time unit. 


for 


Global Memory Size 


Random Random SW, /2 
Random Random 2X max{W,} 
W,;+10 Random SW, /2 
W,+10 Random 2x max{W,} 
Random P,;+10 SW, /2 
Random P;+10 2x max{W,} 


p= 
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Note: Random = random integer in the range [1,100]. 


foah 


TABLE 5: Average Speedups Obtained from Simulation 


Speedup 
(Global Range of Uniproces- 
Mem. Size = | sor Soln. Times 


— 


3.0 10*—1.70x 10° 


—"s 


1.88 x 10°—5.20x 10° 


_ 
©& 0 & & bo 


7.0 104—5.1 10° 


Notes: (i) For a description of the data sets see Table 
3. 
(ii) Speedups are real time ratios. 
(iii) Unit of time is 1 VAX Instruction. 


15 


Generalized Parallel Processing Models for Database Systems 


Pramanik, Sakti 


& 


Kim, Myoung Ho 


Michigan State University 
Computer Science Department, 
East Lansing, MI 48824-1027 


Abstract 


In this paper we propose a two stage abstract parallel processing 
model to facilitate systematic design of parallel processing database sys- 
tems. The objective of this model is to maximize throughput and 
minimize response time through concurrent I/O and processing of data- 
bases. 


Based on the classification of database queries whose parallel pro- 
cessing characteristics are different, we present two specific parallel pro- 
cessing models which follow the abstract model. One is the FX model 
for partial match retrieval type applications, and the other is the Multi- 
directory Hashing model where database accesses are based on primary 
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proposed earlier. 


This two-stage modeling approach presents a new basis on which 
parallel processing systems for various database applications can be 
easily constructed. 


1. Introduction 


Parallel processing in database systems is important because it can 
maximize throughput and minimize response time by increasing con- 
currency in query processing. However, parallel processing by itself 
does not necessarily leads to high performance. Some of the reasons are 
attributed to overhead due to interprocessor communication, remote 
memory accesses and data access conflicts. For parallel database opera- 
tions external I/O also causes serious bottleneck [3]. 


Most past research in this area have focused on machine architec- 
tures [23,24] which are specifically designed for database work. Our 
objective here is to investigate database processing model for general 
purpose parallel processing systems. 


Stone [25] showed that parallel query algorithms in a multiproces- 
sor system may perform poorly than efficient serial algorithms on single 
processor system. The advantage of indexing was also emphasized in 
that paper. Hillyer, et al. [10] and Hawthorn, et al. [9] investigated the 
performance of several database machines and the results show that the 
performance improvement depends on the query type as well as the 
architecture. 


We observe that a parallel processing model which is appropriate 
for one database application may not be appropriate for another. For 
example, the granularity of parallel processing which is suitable for one 
application mat not be so for others. Thus, it may be necessary to 
develop different parallel processing algorithms for various types of 
applications. In the following paragraph we classify eeeee based 
on their parallel processing characteristics. 


This research is supported in part by National Science Foundation Grant No. 


CCR-8706069, and Naval Research Laboratory Grant No. N00014-87-K-2022. 
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The types of queries whose parallel processing characteristics are 
different are classified, as follows. 


(Al) Single query with multiple hits 

(A2) Single query with a single hit 

(A3) Single complex query 

(A4) Multiple queries accessing the same relation 
(A5) Multiple queries accessing different relations 


Examples of queries of type Al are partial match retrieval and 
range queries. Here, intra-query parallel processing is advantageous 
because a single query requiring many data records can be processed in 
parallel by multiple processors. Rosenau, et al. have also applied this 
type of parallel processing for projection operation on a relation [21]. 


It is rather difficult to exploit parallel processing of type A2 query 
because applying parallelism for this type of query may require finer 
granularity which may result in lower throughput of the system. On the 
other hand, parallel processing may achieve the lower bound on access 
time for these types of queries when appropriate software and hardware 
architecture is used. Achieving and guaranteeing this lower bound are 
important for many real-time critical applications. Parallel processing 
models of type A2 queries can be found in [17, 18]. 


Queries of type A3 include join functions, sorting of files, and 
complex qualifications. Several database machines which use function- 
ally distributed architectures have been proposed for this type of query 
[11, 23]. 


For type A4 and A5 queries, transaction processing applications 
are good examples, where many independent queries can be processed in 
parallel. The throughput of the system for these applications can be 
improved by maximizing concurrency among the queries. Parallel pro- 
cessing models for these applications may also be developed based on 
the parallel processing models of type Al, A2 and A3 queries. 


The remainder of this paper is organized as follows. In section 2 
we propose an abstract parallel processing model for database systems. 
Section 3 describes optimal file distribution for Al type queries. In sec- 
tion 4 we propose Multi-directory hashing model for A4 type queries. 
Section 5 contains concluding remarks. 


2. Two Stage Abstract Parallel Processing Model for Database Sys- 
tems 


We propose an abstract database parallel processing model as 
given in Figure 1. The basic idea is to partition data mapping into two 
stages. 


As shown in the Figure, the first stage, H1, is called Data Distribution 
algorithm and the second stage, H2, is called Data Construction algo- 
rithm. @Q; represents a parallel access node. This can be a memory 
module or a disk, depending on the parallel processing environment. 


Dz ta distribution algorithms determine how the data is appropri- 
ately distributed to the parallel access nodes so that maximum con- . 


[pr [—> @—#> 


“@ 


Figure 1. Abstract parallel processing model for database systems 


currency is achieved between the access nodes. Data construction algo- 
rithms, on the other hand, determine the appropriate data structure to 
minimize the access time. It receives data from the data distribution 
algorithm and then create local access structures such as hashed or 
indexed files. 


In general, the following strategies can be employed for data distri- 
bution : 


(B1) Declustering based on query’s data reference pattern 
(B2) Random distribution 

(B3) Objective specific declustering 

(B4) Clustering based on data reference pattern 


In method B1, the data distribution technique takes advantage of 
the data reference pattern of a query. For example, if a query references 
numerous records, the strategy may be to distribute the data so that these 
records are stored uniformly among the nodes. This approach may be 
useful for Al type applications discussed in the previous section. In 
method B2, records are randomly distributed between the nodes. This 
method is simple, but may not guarantee a good distribution. In the 
objective specific method, records are allocated to optimize certain 
objective functions. For example, [17] proposes a data distribution tech- 
nique to construct multiple directories for a single relation, where a 
record is allocated to the node which has the smallest directory size. It 
has been shown in that paper that this approach gives the minimum total 
directory size. However, declustering of data may not be always 
beneficial. For example, if the interconnection network topology is 
based on point-to-point connection and the communication cost is large, 
clustering may give better performance than declustering. So, B4 type 
strategies may depend on the interconnection network topology. 


Data distribution algorithm can be a functional mapping which 
depends only on data values. It maps a set of data values into a set of 
nodes. For example, if node addresses are determined by hashed values 
of input data, distribution algorithm is a mapping which is independent of 
time or other system parameters. On the other hand, the distribution 
algorithm need not be functional. For example, random distribution may 
map the same record to different nodes at different time. Since data 
accesses are content-based for most database applications, it is advanta- 
geous to make a data distribution algorithm functional depending only on - 
data values. | 


Let D be a set of data and Zy = {0, 1,..., M-1} be a set of paral- 
lel access nodes. Let data distribution alzorithin be a function from D to 
Zum. Since actual data are quite unevenly distributed in the domain of 
data, data distribution algorithms are commonly designed based on the 
hashed values of data which are evenly distributed in hashed address 
space. Thus, we define data distribution algorithm H1 as a ccmposition 
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of two functions, H1 and H1™, such that H 1 is a mapping from D 
to T and H1™ is a mapping from T to Zy, where T is the set of hashed 
values. Figure 2 shows two level implementation of H1 for the abstract 
database parallel processing model. This model can be thought of as 
one class of database parallel processing model described in Figure 1. 
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Figure 2. Two level implementation of H1 for the abstract model 


Let H2 be a hash-based data construction algorithm and LD be a set of 
entries in all local directories generated by H2 for a given file system. If 
there exists one-to-one correspondence between T and LD, T is called a 
real global directory. Otherwise, it is called a virtual global directory. 
When T is a real global directory, the set of all the local directories can 
be thought of as a partition of T. H 1 is usually static because the use 
of dynamic hashing for H 1) will cause significant overhead due to 
internode data movement. However, static hashing scheme for H 1 
may result in very sparse local directories or long overflow chains. 
These problems can be avoided by using a virtual global directory, where 
the actual local directories are determined by H2. 


When T is a real global directory, the ratio ITI/IDI directly affects 
the data retrieval time as well as storage utilization. On the other hand, 
we have more flexibility when T is a virtual global directory. The com- 
parison of these two approaches will be described in more detail in sec- 
tion 3.2 and 4.2. Functional distribution, and real/virtual global directory 
concepts are used for FX model and Multi-directory hashing model 
presented in section 3 and 4. 


3. Parallel Processing Model for Partial Match Retrieval Type 
Queries 


In this section we present parallel processing model to process par- 
tial match retrieval type queries. Partial match queries are queries where 


_ some of the attributes are specified, hence a set of qualified records need 


to be retrieved. For example, q = [Age = *, Department = "mathemat- 
ics", State = "Ohio"] is a partial match query, where * denotes a don’t 


care condition. 


It has been shown that multi-key hashing is effective for partial 
match retrieval type applications. Multi-key hash function, H, for a 
database consisting of n ficlds is a set of n functions {H,, --- , H,} 
such that given a record r = <r,,°°:,7,>, H(r) = 
<H,(71), H,(7,)>. H(r) is usually called a bucket. Rothnie, et al. 
[22] and Rivest [20] have independently proposed the use of multi-key 
hashing, as an alternative to inverted files, to reduce the total search time 
for partial match retrieval type queries. The design of multi-key hash 
functions was considered in [4]. The determination of each field size for 
minimum search time based on query statistics was also investigated by 
[1,2]. In [5] it has been shown that the problem of finding the optimal 
field sizes for multi-key hashing scheme is NP-hard. 


The main objective of this section is to minimize the total number 
of bucket accesses for a partial match query by distributing buckets in 
multi-key hashing. 


3.1. Data Distribution Algorithm For Partial Match Retrieval 


The data distribution is said to be optimal for a partial match 
query, when no device has more than [total number of qualified buckets / 
number of devices] buckets. It has been shown in [26] that there does 
not exist an optimal data distribution method in certain types of file sys- 
tems. 


There are a few heuristic methods for distributing data in partial 
match retrieval type queries. Du, et al. have proposed data distribution 
method based on modulo allocation [6]. Modulo allocation is simple but 
does not work in many cases. For example, it may not give optimal dis- 
tribution if some of the field sizes are less than the given number of dev- 
ices. So, for a large number of parallel processing nodes such as 
Butterfly machines[27], Modulo distribution may not be appropriate. 
Generalized Disk Modulo (GDM) method has also been proposed in [6] 
to overcome this problem. This method gives a sufficient condition to 
achieve optimal distribution. However, no general method has been 
given to find the optimal distribution parameters. In fact, the problem of 
finding the optimal parameter values could be very complex [6]. Several 
useful properties of these modulo based distribution methods have also 
been given in [26]. Data distribution methods based on minimal span- 
ning trees and short spanning paths have also been proposed in [8]. 


In this section we propose Fieldwise eXclusive-or (FX) distribu- 
tion method which gives better performance for a wider range of param- 
eter values than existing methods. The basic idea of the FX distribution 
method is the use of bitwise exclusive-or operation on the field values 
which are computed by multi-key hashing. Here, we show several use- 
ful characteristics of exclusive-or operation for optimal data distribution. 
Field transformation techniques have been used to extend the scope of 
optimality in FX distribution. | 

Before describing FX distribution method, it is necessary to intro- 
duce some notations as well as relevant definitions and assumptions for 
this section. 


Definition : 

°  f,={0,1,..., F;-1}, a set of hashed values of field i. 
° F; denotes | f;!. 

° M denotes the number of parallel devices. 

° N is the set of all natural numbers including 0. 

° Zw Is the set of all integers from 0 to M-1. 


° (Gm-1 -» Qo)g is a binary notation of an integer, where a; is a 
binary digit. | 
[f;! 1s assumed to be a power of 2 which is common for hash 
directory files for partitioned [1] or dynamic hashing schemes [7, 13, 14]. 
The number of devices M is also assumed to be a power of 2. 


Definition : Let R(q) be the set of buckets which satisfy qualifications 
for a partial match query q. The distribution method is called strict 
optimal for a partial match query in a given file system if each device has 
no more than [1R(g)I/M] number of buckets. When the distribution 
method is strict optimal for all possible partial match queries in a given 
file system, it is called perfect optimal for that file system. 


Definition : [+] denotes exclusive-or operation between two bits. We 
will use the same notation [+] to denote exclusive-or operation between 
integers and sets of integers as follows. When X = (a,,-; ... d@o)g and Y 
= (bm, ... bo)g are two integers, X [+] Y = (am-1 [+] Dm-1 ... @o [4] 
bo)g. If X is an integer and Y = {y,,..., y,} is a set of integers, X [+] Y 
is defined as {X [+] y; | y; € Y}. If both X= {xj,..., xx} and Y = {y,, 
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» yr} are sets of X [+] Y is defined as 


{x; [+] y; | x; € X, yj € Y)} 


integers, 


For example, if X, =2 and Y, =3 then X, [+] Y,=1. IfX,=2andY, 
= {0, 1, 2, 3} then X> [+] Y, = {0, 1,2, 3}. 


Definition : +H") =Y, [+] Y2H+]¥3 °°: HI Y,.- 


Note that [4] is a shorthand notation for performing exclusive-or opera- 


tion between sets of integers Y,, Y, ... , Y,. 


Because of space limitation, the proofs of the lemmas and the 
theorems are not given. They can be found in [12, 19]. 


Compared to the abstract model of Figure 2, H1™ is a multi-key 
hashing and FX distribution in this section corresponds to H1™, In FX 
distribution model, T described in Figure 2 is a set of ordered n-tuple 
produced by multi-key hashing. 


3.1.1. Basic FX Distribution 
Let f,xf>x ... xf, be a set of all buckets. Basic FX distribution 


method allocates bucket <J;, ... , J,> into device Ty ie a} where 
C j= 


Ty : N — Zy is a function which returns only the rightmost log,M bits 
of domain values, and J; € f; forj=1,...,n. 


Example 1. Table 1 shows the bucket distribution by Basic FX distribu- 
tion method, where f, = {0, 1}, fo = {0, 1, 2, 3, 4, 5, 6, 7} and M = 4. 
In this table, binary numbers are used for field values and decimal 
numbers are used for Device No. (This convention will be used in all 


examples of FX distribution). Here, Device No = Ty br [+] J 2} where 


J, € fi, J2 € fz and Ty returns the rightmost two bits of the result of 
J, G1) J2. 


0 
1 
2 
3 
0 
1 
2 
3 
1 
0 
3 
2 
1 
0 
3 
2 


Table 1. Basic FX distribution 


As shown in the Table 1, Basic FX distribution is strict optimal for any 
partial match query in the file system of example 1. For example, when 
(001), is specified for the first field and the second field is unspecified, 
we have to access eight buckets <(001),,(000)3>, ... , <(001),,(111),>. 
Since each device has two qualified buckets for this partial match query, 
FX distribution is strict optimal for this query. 


Lemma 1.1. Zy is a set which contains M different nonnegative 
integers from 0 to M-1. Let k be some integer 0 <k <M-1. Then Zy 
[+] k= ZM- 


Example 2. Let Z, = {0, 1, 2,3, 4, 5, 6, 7} andk =3. Then Zg [+] k 
= (3, 25 1, 0, dy 6, 5, 4} =Ze. 


Lemma 1.1 is a basic property which is used in the proofs of several 


theorems. 


Theorem 1. Basic FX distribution is strict optimal for any partial 
match query in which the number of unspecified fields is O or 1. 


Theorem 2. For any partial match query which has two or more 
unspecified fields, Basic FX distribution is strict optimal, if there exists at 
least one unspecified field i such that F; > M. 


Note that Theorem 1 works for partial match queries with only one 
unspecified field while Theorem 2 applies to partial match queries with 
more than one unspecified fields. 


Theorem 1 and 2 show general characteristics of exclusive-or 
operation for optimal file distribution. This is mainly due to the pro- 
perty described in Lemma 1.1. However, Basic FX distribution does 
not give optimal distribution for partial match queries with 2 or more 
unspecified fields, when the size of none of the unspecified fields is 
greater than or equal to M. For example, when M = 16 and all others are 
the same as in example 1, the distribution is not optimal. Since every 
element in f, and f2 is much smaller than M, the reason of not being 
optimal is clear. Theorem 3 gives the sufficient conditions for optimal 
distribution for these cases. 


Theorem 3. Let q(f) = {i1, i, ... , i} be the set of unspecified fields 
for partial match query q, where F; < M, for all j € q(f). Basic FX dis- 
tribution is strict optimal for partial match query q, if there exist a set of 
fields {i,, » yj} c q(f) such that If,,x-:: xf;,| 2M and 
fi,X eee Xfi, | 


Xfi, !/M for all z € Zy. 


#{V;,; ae Ji.) € 


| fi, KX eee 
Theorem 3 says that we can guarantee optimal distribution, if (1) 
there exists a subset of the unspecified fields whose size of cartesian pro- 


duct is greater than or equal to M and (2) the records projected on these 
fields are distributed uniformly among the M devices. 


j 
Ty |[+]V;,)| = 2) = 
p=l 


However, when the size of none of the unspecified fields is greater 
than or equal to M, the conditions given in Theorem 3 is not satisfied in 
Basic FX distribution. In the next section we will introduce field 
transformation techniques. These field transformation techniques 
increase the scope of optimality by themselves, and also utilize Theorem 
3. 


The following paragraph exemplifies the idea of field transforma- 
tion techniques. Let f; = {0,1}, fo = {0, 1,2, 3, 4, 5, 6, 7} and M = 16. 
As we discussed, Basic FX distribution method does not give an optimal 
distribution for this file system. Let X be an one-to-one mapping such 
that X (f,;) = {0, 8}. When Basic FX distribution method is applied for 
X (f;)Xf2, the distribution is perfect optimal. (It can be easily verified by 
substituting (1000), for (001), in f, column of Table 1.) Now, the 
problem is to find a general one-to-one mapping, X, such that Basic FX 
‘distribution method for X (f;)xf> gives optimal distribution. 


We will present several field transformation functions such as, X, 
described above. Even though the techniques developed in this paper 
may not achieve perfect optimal distribution in all the cases, this 
extended FX distribution method will give strict optimal distribution for 
a large class of partial match queries. 


3.1.2. FX Distribution With Field Transformation Functions 


In the previous section Basic FX distribution method was defined. 
In this section we extend Basic FX distribution method by using the field 
transformation techniques. 


Let f,Xfox ... Xf, be a set of all buckets. Extended FX distribu- 
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tion method allocates bucket <J;, ... ,J,>, Jj © fj for j=1,... , n, into 


device Ty [ya alk op} where 
j= 


i) if f1>M, X”%" is an identity function, 
ii) if Ifjl<M, X M,'fi' is an element of set of injective (one-to-one) 
functions whose domains are f; and ranges are Zy. 
x™'Fi! is called field transformation function. 

When X™'4i! is identity function for all j=1, ... , n, Extended FX > 
distribution method reduces to Basic FX distribution method. From now 


on, we will simply call FX distribution instead of Extended FX distribu- 
tion. 


It is easy to see that all lemmas and theorems that hold for Basic 
FX distribution also hold for FX distribution. Since the fields whose 
sizes are no less than the given number of devices M, do not cause any 
problem (whether it is specified or not), in this subsection we will focus 
only on the fields whose sizes are less than M. 


Definition : Let M be a power of 2. 


(1) I: N-N isan identity function. 

(2) For each proper subset f; of Zy, where |f,| is some power of 2, 
Ut! sf —» Zy is a function such that U”'() =1a,"""", 
where le fy, d)"'"' = es ; 

fil 

(3) For each proper subset f,; of Zy, where If,| is some power of 2, 
wi", hi > Zy is a function such that 
ru) = 14 1d", where Le fi,di'M = 5. 

I 
(4) For each proper subset f, of Zy, where If;| is some power of 2, 


mw ; fim Zu is a function such that 
m2) = 1H dy [4] dd", where Le fy, dy" 


mig lane fl if {<M 
dio = 0 


- M_ 
Aft? otherwise 


We have defined the four groups of basic functions, I, U cl : 
ru 1'"" and 1U2™'*' which will be used in various combinations for 
optimal file distribution. For example, for any values of If;!, |f;! and 
M, FX distribution method distributes elements of I(f;) x U"’'"'(f5) 
optimally. 

It is not difficult to see that for any proper subset f; of Zy whose 
If,| is some power of 2, all the functions defined above satisfy the require- 
ments of field transformation functions described previously. 


Because of notational complexity, when the context is clear, we 
will leave out the superscripts M, |f,! from transformation functions and 
their parameters. 


Theorem 4. | When there are only two fields i, j} whose sizes are less 
than the given number of devices M, FX distribution with J (f;) and U (f;) 
is perfect optimal. 


Example 3. Let f; = (0, 1, 2, 3}, fo = {0, 1, 2,3} and M=16. Table 2 
shows bucket distribution by FX and Modulo methods. Note that J (f;) 
= {0, 1,2, 3} and U(f,) = {0, 4, 8, 12} are I and U transformed values of 
fi, £2 and denoted by binary numbers in the table. Here, Device No = 
Tyu@V1) 4) UGV2)) for FX distribution, and Device No = 
(J, +J2) mod M for Modulo distribution, where J, € f,,J2 € fo. 


The FX distribution in Table 2 is optimal. But in Modulo distribu- 
‘tion, it is skewed. GDM method can also give optimal distribution by 
multiplying 3 to the first field values and by 4 to the second field values. 
However, these parameters should be found by trial and error method. 


Device No (FX Device No (Modulo 


a) 


NWE SAN HONG eRO 


0 
1 
yd 
3 
0 
1 
2 
3 
0 
1 
2 
3 
0 
1 
2 
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Table 2. FX distribution with I and U transformation 
On the other hand, FX distribution techniques give a specific method. 


Theorem 5 When there are only two fields i, k whose F; and F, are 
less than the given number of devices M, FX distribution with J (f;) and 
IU 1(f,) is perfect optimal. 


Theorem 6. When there are only two fields j, k whose F;, F, are less 
than the given number of devices M, FX distribution with U(f;) and 
IU 1(f,) is perfect optimal. 


Example 4. Let f; = {0, 1, 2, 3}, f. = {0, 1,2,3} and M=16. Table 3 


shows the FX distribution with U(f,), JU1(f2). Here, Device No = 
Ty(U J) [+] JU 10 2)), where J; € fi,J2 € fo. 


Table 3. FX distribution with U and IU1 transformation 


Theorem 7. When there are only two fields i and k whose F; and F, 
are less than the given number of devices M, FX distribution with /(f;) 
and JU 2(f;) is perfect optimal. 


Theorem 8. When there are only two fields j and k whose F; and F;, 
are less than the given number of devices M, FX distribution with U(f;) 
and JU 2(f,) is perfect optimal. 

Lemma 9.1. When there are only three fields i, j and k whose sizes are 
less than the given number of devices M, FX distribution with / (f;), 
U (f;) and JU 2(f;) is perfect optimal, if either 

(1) there exist at least 2 fields p and q such that p,q € {i, j, k} and 
FF, 2M or 

(2) F, 2F; and F,? <M 

Theorem 9. Let L be the set of fields whose sizes are less than the 
given number of devices M in a given file system. FX distribution with 
I, U and IU2 transformation can be always perfect optimal, if ILI < 3. 


Example 5. Let f,; = {0, 1, 2, 3}, f. = {0, 1}, fz = (0, 1} and M = 16. 
Table 6 shows the FX distribution with J (f,), U (f2) and JU 2(f;). 
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Table 4. FX distribution with I, U and IU2 transformation 


We have determined, through theorems, the class of partial match 
queries whose qualified buckets are distributed optimally under FX dis- 
tribution. Even though FX distribution does not always guarantee strict 
optimal distribution, FX distribution gives optimal distribution for a large 
class of partial match queries. 


3.1.3. Performance Comparisons to Other Distribution Methods 


In this section we compare FX distribution with Modulo and GDM 
method. The performance comparisons are based on the probability of 
strict optimality and response time for a given partial match query. In the 
following subsections, it is assumed that the probability of each field 


being specified is the same for all fields and some field being specified is 


independent of each other. 


3.1.3.1. Probability of Strict Optimality 


In this section we show that the probability of strict optimality for 
FX distribution is higher than Modulo distribution. Even for the worst 
case the decrease of probability of strict optimality for FX distribution is 
not much. On the other hand, in Modulo distribution the decrease is 
quite large. Since no general method has been given to determine the 
existence of parameter values for strict optimal distribution in GDM 
method, we compare FX distribution to only Modulo distribution in this 
section. 


Figure 3 and 4 show the percentage of strict optimal distribution 
for all possible partial match queries in a given file system consisting of 
ten fields. In all these Figures MD denotes Modulo Distribution and FD 
denotes FX Distribution. Here, results are computed from sufficient 
conditions given for each method. Figure 3 shows the case where any 
two fields p and q satisfy the condition, F,F, 2 M. In this figure FX 


distribution used I, U and IU1 transformation methods. 
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PERCENTAGE OF STRICT OPTIMAL 


NUMBER OF FIELDS WHOSE 
SIZES ARE LESS THAN M 


Figure 3. Probability of strict optimality 


Figure 4 shows the case when for any two fields p,q, F,F, < M 
but for any three fields p,qr F,F,F, 2M. Here, in FX distribution I, U 
and [U2 transformation methods are used. 


100. 


a 
<< 
= 
& 80 FD 
oO 
O 
= MD 
% 60 
”Y 
ice 
rs 
=) a0 
(@) 
< 
5 
ly =. 20 
O 
[aad 
lJ 
Qa 

0 

0 2 4 6 8 10 


NUMBER OF FIELOS WHOSE 
SIZES ARE LESS THAN M 


Figure 4. Probability of strict optimality 
3.1.3.2. Average Response Time 


Definition : Fora given partial match query q, 7;(q) is defined as the 
number of qualified buckets in device i for a partial match query gq. We 
call this a response size for device i. Then, the largest response size for 
a partial match query q is defined as MAX(r,(q), r2(q), °° Ty_-1(q@)). 


For the response time of a partial match query, we consider two 
factors, namely, largest response size and CPU computation time for 
bucket distribution and inverse distribution, where inverse distribution is 
a procedure used to find qualified buckets. In parallel disks environ- 
ment, largest response size is the most important factor, while in main 
memory databases CPU computation time is more important. 


(1) Largest Response Size 


When systems are configured such that data retrieval time for any 
device is the same, the response time for a partial match query is 


determined by the device which has the largest number of qualified buck- 
ets. 


Table 5 through 7 show the largest response size of Modulo, GDM 
and FX distribution for some typical file system environments. The 
number of fields is assumed to be 6 for all these experiments. The first 
column denotes the number of unspecified fields. For GDM method we 
used three different sets of multiplication parameters. These sets are 
GDM1 : 2, 3, 5, 7, 11, 13 and GDM? : 2, 5, 11, 43, 51, 57 and GDM3: 
41, 43, 47, 51, 53, 57. The FX distribution of Table 5 and 6 used I 
transformation for fields 1 and 4, U transformation for fields 2 and 5, IU1 
transformation for fields 3 and 6. The FX distribution of Table 7 used 
IU2 transformation instead of IU1 transformation and others are the 
same as in Table 5 and 6. In all Tables, each entry is computed as an 
average value of largest response sizes from all possible partial match 
queries for that entry. 

The tables show that except for first row of Table 6 and 7, FX dis- 
tribution gives smaller largest-response-size than the other methods. FX 
distribution is also very close to optimal. It should also be noted that 
there may be a set of multiplication parameters by which GDM method 
can give better performance than those of GDM1, GDM2 and GDM3. 
However, even though such a set of parameters may exist, it can only be 
found by trial and error method. 


(2) CPU Computation Time 


In GDM method we reduce computation time by changing Modulo 
function into AND operation. This can be done because the number of 
devices is assumed to be a power of 2. In FX distribution, since the 
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Modulo 
8.0 
48.0 
344.0 


3.7 
18.9 
132.5 
1031.7 
8202.0 


3.2 
16.0 
128.0 
1024.0 
8192.0 


3.3 
18.1 
130.5 
1026.3 
8196.0 


3.6 
18.9 
132.7 
1029.7 
8198.0 


2460.0 
18152.0 


GDM1 
2.1 
10.2 
68.3 
520.5 
4114.0 


8.0 
48.0 
344.0 
2460.0 
18152.0 


2:2 
10.3 
68.1 

517.0 
4102.0 


4102.0 
Table 6. M = 64, F,=...=F.=8 
GDM1 


1.7 
10.0 


GDM2 | GDM3 
5.6 
42.2 
408.67 
4313.0 


5.1 
37.3 
384.0 
4096.0 


90.3 
909.5 
9176.0 


Table 7. M= 512, Fy =,=F,=8 and F,z=.=F .=16 


multipliers for U, IU1 and IU2 transformation are always power of 2, we 
can substitute multiplication by shift operation. Note that we cannot do 
this in GDM method because multipliers in GDM method are usually 
chosen from prime or odd numbers. Function Ty is done by AND 
operation. 


In MC68000 processor, computation time of FX method is much 
faster than GDM method (In MC68000, XOR takes 8 cpu clock cycles, 
ADD takes 4 clock cycles, AND takes 4 clock cycles, n bit shift takes 6 
+ 2n clock cycles. But multiplication takes about 70 clock cycles). In 
intel 80286/80386 processor the ratios of clock cycles between different 
operations are almost similar to those of MC68000. 


For main memory database systems FX method is more efficient 
than GDM method. The computation time for Modulo distribution is 
shorter than FX distribution. However, as shown in Table 5 through 7, 
Modulo distribution is not suitable for a large number of parallel devices. 


3.2. Data Construction Algorithms For Partial Match Queries 


Multi-key hashing for a given file with n fields produce a subset of 
T, where T = f,xf>x ... xf,. As discussed in section 2, T can be used as 
either a real global directory or a virtual global directory. 


3.2.1. Data Construction Using T As A Real Global Directory 


Let GD = [0..F,-1, ... ,0..F,,—1] be a multi-dimensional array in 
which the range of i-th dimension is 0..F;-1. This is the same range of 
multi-key hashing for field i. Here, GD serves as a real global directory. 
Each element of GD contains an address of a bucket. FX distribution 
partition multi-dimensional array GD into M subsets. Note that a direc- 
tory is also distributed among the nodes to achieve maximum con- 
currency. Then, we have to have efficient storage rule of this multi- 
dimensional array (e.g., the local address of array elements in each dev- 
ice). The storage rule for the distribution of this multi-dimensional array 
can be found in [19]. 


Using T as a real global directory is advantageous when for most t 
€ T, 617'(£) # 6, where @1 is a given multi-key hash function. How- 
ever, since T consists of cartesian product of all fields, many elements in 


GD may be empty. For example, when only = of GD are actually 


used, the waste of storage to construct a whole directory is quite 
Significant. 


3.2.2. Data Construction Using T As A Virtual Global Directory 


‘Let <J;,J,...,J,_ > be an ordered n-tuple produced by multi- 
key hashing @1. The local hash function @2 use this ordered n-tuple as 
an input key for its local directory when T is used as a virtual global 
directory. Note that the local directories physically exist and are consid- 
erably smaller than virtual global directory T which does not exist in this 
case. The local directories can dynamically grow and shrink while the 
virtual global directory is static. 


This mapping scheme for local data construction is quite useful 
when for most te T, @17'(t) =. The disadvantage of this approach is 
that it may cause more probings to find qualified records. This is 
because different buckets produced by ©1 can be mapped into the same 
local directory entry by ©2. 


Let V1 be a set of buckets produced by ©1 for a given file which 
are allocated into a same device. Let V2 be the range of local hash 
function ©2. Let w1 =IV1l and W2=!V2). Let t be an average number 
of elements in V1 which are mapped into same v € V2 by ©2. Then, 
the probability P that ©2-'(v) = » for some v € V2, is given by 

im 


1 1 
P= }1-—-—~| . Letc=u2/ul. Then, t= —————. This can be 
| y2 al c(1-e~"*) 
ictved by dane: Sine Se ene a se 
Pree? Wee. eat ~ (Py? ° 


decreases with the increase of c and converges to 1. 


Achieving efficient storage utilization, e.g., c = 1, causes more 
probings to find a bucket. To reduce the number of probings, larger local 
hashed directory is needed. Compared to the multi-dimensional array 
view of a real global directory, if the empty portion of GD is large, the 
mapping scheme of a virtual global directory can achieve data retrieval 
as fast as a real global directory scheme, while it requires less amount of 
memory than a real global directory. Otherwise, both methods have 
time/space trade-off. | 


4. Multi-Directory Hashing for Key Access Applications 


In this section we present another application of the abstract paral- 
lel processing model. This application is for random access file system 
and is based on multi-directory hashing scheme [PrCh87]. Multi- 
directory hashing is a class of hashing schemes which use multiple direc- 
tories to maximize concurrent accesses to a single relation. Each one of 
the directories can be of different size and it grows and shrinks dynami- 
cally. 


It will be shown that multi-directory hashing provides better per- 
formance than single directory hashing schemes. The performance 
difference between single directory and multi-directory hashing schemes 
becomes significant in main memory databases. This is because short 
overflow chain (e.g., one record) is needed for main memory databases 
to reduce the main memory processing cost (e.g., comparing key values). 
In the following section we present one possible method for implement- 

ing multi-directory hashing. 


4.1. Construction of A Multi-Directory Hashing 
The algorithm for a multi-directory hashing is described below. 


(1) The hashed address is partitioned into two parts. One part is used 
for directory number and the other part is used for locating the 


record within a local directory. 


(2) Each directory of a multi-directory hashing is created based on a 


hashing method. 


Figure 5 shows the address mapping scheme of a proposed multi- 
directory hashing. This figure also shows relationship between the func- 
tions of this model and those of an abstract model in Figure 2. 
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input for H2:~ ~~ "7" 7777777 Pao me ™ input for H 1 


directory # 


Figure 5. Hash address Mapping in Multi-Directory Hashing 


In the next section the performance improvement of a multi- 
directory hashing over single directory hashing is described. 


4.2. Performance Comparison for Multi-directory Hashing 


Here, we show the reduction in main memory requirement while 
achieving near optimal response time (i.e., one access for a record and 
one key comparison). Extendible and linear hashing schemes are used 
for constructing the directories. 


In the following figures the directory size for a multi-directory 
hashing corresponds io ihe toiai size of aii the direciories. Figure 6 
gives the directory sizes for a multi-directory hashing when 5000 unique 
key values are inserted into the file. In this figure the total directory size 
decreases considerably with increasing number of directories. 


Figure 7 shows the cases of various file sizes. We see that the 
reduction in directory sizes is significant for larger files when the number 
of directories increases. 


The lower bound on response time can be achieved in multi- 
directory hashing at a much lower main memory requirement. On the 
other hand, the throughput increases considerably by concurrent process- 
ing of the multi-directory hashing. Here, we process data requests in 
parallel by accessing multiple directories concurrently. 


2 extendible hashing 


2 #: linear bashing 


total directory size (Log 2) 


) 1 2 3 4 5 6 7 8 9 10 


number of directories (Log 2) 


Figure 6. Total directory size in multi-directory hashing 
5. Conclusion 


In this paper we propose abstract parallel processing model for 
database systems which consists of data distribution stage and data con- 
Struction stage. This two-stage model helps in systematically develop- 
ing an efficient parallel processing database system. 


We classify database queries whose parallel processing charac- 
teristics are different. Parallel processing models for two of these 
classes are then presented. Maximizing concurrency and minimizing 
response time are the most important objectives for these two models. 


is 
o 
375 psa 34 


1024 


total directory size (Log 2) 
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1 2 3 : 4 5 
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Figure 7. Multi-directory hashing for various file sizes 


First, we propose the FX model for partial match retrieval type 
queries. In the distribution stage of FX model, several characteristics of 
exclusive-or operation are exploited to achieve optimal file distribution. 
The optimality conditions are derived through lemmas and theorems. 
We compare the FX distribution method with others and show the per- 
formance improvement of our methods. In the construction stage of FX 
‘model, two data construction methods based on the real and virtual glo- 
bal directory, are presented. Here, performance trade-off for these 
methods are investigated. 


Second, we propose the multi-directory hashing scheme which is 
suitable for concurrent accesses to a single file. Our focus here is main 
memory database accesses on primary keys. We show that the proposed 
multi-directory hashing scheme gives improved performance over single 
directory hashing. 
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Abstract 


Efficient algorithms for image template matching on fine 
grained SIMD hypercube multicomputers are developed. Our algo- 
rithms are asymptotically faster than previously known a areoremie 
for this problem. 
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1. INTRODUCTION 


The inputs to the image template matching problem are an 
NXN image matrix //0..N —1,0..N —1] and an MxM template 


T|0..M —1, QM sj} The output is an NXN matrix C2D where 

C2D{i, j}= 3) SMe +u) mod N, (7 +v) mod N) * T{u,v] 
u=O0 v=0 

C2ZD is called the two dimensional convolution of { and T. Tem- 


plate matching, i.e., computing C2D, is a fundamental operation 
in computer vision and image processing. It is often used for edge 
and object detection; filtering; and image registration [BALL85] 
[ROSE82]. Because of the fundamental, nature of this problem and 
because of its high complexity (O(M? N *) on a single processor com- 
puter), much attention has been devoted to the development of 
efficient fine grain multicomputer parallel algorithms. For exam- 
ple, Chang, Ibarra, Pong and Sohn [CHAN87] have studied this 
problem on an SIMD pyramid computers; Ranka and Sahni 
[RANK87a], Maresca and Li [MARE86] and Lee and Agarwal 
[LEE87] have considered mesh connected computers; and Fang, Li 
and Ni [FANG85], Fang and Ni [FANG86], and PrassanaKumar 
and Krishnan [PRAS86] have considered SIMD hypercube multi- 
computers; and Ranka and Sahni [RANK88] have considered 
MIMD hypercube multicomputers. 


In this paper, we restrict our attention to SIMD hypercube 
multicomputers. We develop three asymptotically optimal algo- 
rithms that require N” processors. These require O(M), O(logM) 
and O(1) memory per processor, respectively. The O(M) memory 
algorithm is faster than the algorithms for O(logM) and O(1) 
memory by a constant factor. While the O(logM) and O(1) 


memory algorithms are of comparable complexity, the former is 
conceptually simpler. [PRAS87| considers only the cases of O(M) 
and O(1) memory. The algorithms developed in [PRAS87] require 
a broadcast capability in the SIMD hypercube. Our algorithms do 
not require this. While the algorithm in ([PRAS87] for the case of 
O(M) memory is optimal, ours uses fewer interprocessor routes and 
so is faster even though no broadcasting is used. The algorithm of 
[PRAS87| for O(1) memory is suboptimal by an O(logM) factor. 
Our algorithm runs in the asymptotically optimal time of O(M’). 
Using the techniques of [PRAS87], each of our three algorithms 
may be generalized to obtain asymptotically optimal algorithms 
for SIMD hypercube computers with N’ i 1<K <M processors 
and O(M/K), O(log(M/K)), and O(1) memory per processor 


respectively. 


Section 2 describes our hypercube model. In addition, nota- 
tion and some fundamental data movement operations are 
developed in this section. In section 3, we develop fine grained algo- 
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rithms for one dimension convolution. These form a basic com- 
ponent of our two dimensional convolution algorithms which are 
developed in Section 4. 


2. PRELIMINARIES 


2.1. Hypercube Multicomputer 


The important features of an SIMD hypercube and the pro- 
gramming notation we use are: 


di There are P =2” processing elements connected together via 
a hypercube interconnection network (to be described later). 
Each PE has a unique index in the range (0, oF —1]. We 
shall use brackets({ |) to index an array and parentheses(’()’) 
to index PEs. Thus A[?] refers to the 7¢’th element of array A 
and A(t) refers to the A register of PE ¢. Also, A{j](¢) refers 
to the 7’th element of array Ain PE ¢. The local memory in 
each PE holds data only (i.e., no executable instructions). 
Hence PEs need to be able to perform only the basic arith- 
metic operations (i.e, no instruction fetch or decode is 


needed). 


2: There is a separate program memory and control. unit. The 


control unit performs instruction sequencing, fetching, and 
decoding. In addition, instructions and masks are broadcast 
by the control unit to the PEs for execution. An 
instruction mask is a boolean function used to select certain 
PEs to execute an instruction. For example, in the instruction 
A(t) :=A(i)+1, (4 =1) 
(¢, =1) is a mask that selects only those PEs whose index has 
bit O equal to 1. Le., odd indexed PEs increment their A 
registers by 1. Sometimes, we shall omit the PE indexing of 
registers. So, the above statement is equivalent to the state- 
ment: 


A:=A+41, (ij)=1) 


3. A p dimensional hypercube network connects 2” PEs. Let 
p-*pas ..4g be the binary representation of the PE index 7. 

ta - be the complement of bit 1%,. A hypercube network 
directly connects pairs of processors whose indices differ in 


exactly one bit. = » processor rr) .t) 1S connected to pro- 
cessors 4 


—Q°° 
+49, O<k<p—. We use the notation 7 


ie as to 
represent the ats ee that differs from 7 in exactly bit 6. 
4. Interprocessor assignments are denoted using the symbol +, 


while intraprocessor assignments are denoted using the sym- 
bol :=. Thus the assignment statement: 

B(i*) —-B(é), (ig =0) 
is executed only by the processors with bit 2 equal to 0. 
These processors transmit their B register data to the 
corresponding processors with bit 2 equal to 1. 


5. Ina untt route, data may be transmitted from one processor 
to another if it is directly connected. We assume that the 
links in the interconnection network are unidirectional. Hence 
at any given time, data can be transferred either from PE 7 
(4, =0) to PE i?) or from PE i (t, =1) to PE i) Hence the 
instruction. 
B(i) —-B(i), (ig =0) 

takes one unit route, while the instruction: 
B(i”)) —B(i) 

takes two unit routes. 


' 
All logarithms are assumed to have base 2 


6. Since the asymptotic complexity of all our algorithms is 
determined by the number of unit routes, our complexity 
analysis will count only these. 


Hypercube Embedding of a Grid 


Figure 1 gives a two dimensional grid interpretation of a 4 


2.2. 


Figure 1: Embedding of a 4 x 4 mesh in a 
hypercube of dimension 4 


dimensional hypercube. The index of the PE at position (7, 7) of 
the grid is obtained using the standard row major mapping of a 
two dimensional array onto a one dimensional array [HORO85]. 
Le, for an NXWN grid, the PE at position (7, 7) has index «N + 9. 
Using this mapping, a two dimensional image _ grid 
I(0..N —1, 0..N —1) is easily mapped onto an N’ hypercube (pro- 
vided N is a power of 2) with one element of J per PE. Notice that 
in this mapping, image elements that are neighbors in J (i.e., to the 
north, south, east, or west of one another) may not be neighbors 
(i.e., may not be directly connected) in the hypercube. This does 
not lead to any difficulties in the SIMD algorithms we develop. 


2.3. Basic Data Manipulation Operations 


2.3.1. Data Circulation 


Consider a P =2” processor hypercube. We are required to 
circulate the data in the A register of these PEs so that this data 
visits each of the P processors exactly once. A near optimal circu- 
lation for SIMD hypercubes results from the use of the exchange 
sequence X, [DEKE81} defined as 

X,=0, X, =X,4, 9-1, X44 (q >1) 
This sequence essentially treats a q dimensional hypercube as two 
g—1 dimensional hypercubes. Data circulation is done in each of 


these in parallel using X,_,. Next an exchange is done along bit 
q—1. This causes the data in the two halves to be swapped. The 
swapped data is again circulated in the two half hypercubes using 
X,. Let f(q, ¢) be the ¢ ’th number (left to right) in the sequence 
Xa 1<i<2°. The resulting SIMD data circulation algorithm is 
given in Figure 2. Because of our assumption of unidirectional 


Ce on pee ee oe ae ee 


procedure CIRCULATE{A); 
{data circulation} 
fori=1ltoP-1 do 
ACs!) Ali); 


end 


Figure 2: Data circulation in an SIMD hypercube 
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links, each iteration or the for loop of Figure 2 takes 2 unit routes. 
Hence Figure 2 takes 2(P-1) unit routes . The function f can be 
computed by the control processor in O(P) time and saved in an 
array of size P—1 (actually it is convenient to compute f on the fly 


using a stack of height logP). The following Lemma allows each 
processor to compute the origin of the current A value. 


Lemma 1: Let Ag, Aj,...., Ayy_, be the values in 
A(0), A(1), ...., A(2”—1) initially. Let index( 7,7) be such that 
Alindex( 7, 7)] is in A( Jj) following the 7’th iteration of the for 
loop of Figure 2. Initially, indez(7,0)=j. For every 
i, 1>0, index( 7, 7) = index( j,i —1)02°'”” (6 is the exclusive 
or operator). 

Proof: See [RANK87b]. 


Some of our algorithms will require the circulating data to 
return to the originating PEs. This can be accomplished by a final 
data exchange along the most significant bit. For convenience, we 
define f(p, 2”) =p —1. 


2.3.2. Data Broadcast 


In a data broadcast, data originates at one PE and is to be 
transmitted to the remaining P—1 PEs. This can be done using 
logP unit routes [DEKE81]. 


2.3.3. Window Broadcast 


Assume that W is a power of 2 and that a P processor 
hypercube is tiled by windows of size 1 X W such that each win- 
dow forms a subhypercube with W PEs. In a window broadcast, 
data originates in one of these windows (different data in different 
PEs of the window). The data in this window is to be copied to the 
remaining (P/W) —1 windows. This copying can be done using 
log(P/W) unit routes [DEKE8]1]}. 


2.3.4. Data Sum 


Assume the window tiling of Section 2.3.3.. The data in each 
of the windows is to be summed and the sum left in a prespecified 
PE (same relative PE for each window). For example, if we are 
summing the A register data, we may be required to compute: 


Sum(tW) = SJA(GW4 7),  O<1<(P/W) 


j=0 
Here, the sum is left in the first PE of each window. Data sum can 
be done in log W unit routes [DEKE81]. 


2.3.5. Shift 


SHIFT(A,t,W) shifts the A register data circularly counter- 
clockwise by 7 in windows of size W. I.e, A(qW-y) is replaced by 
A(qW +(j-1) mod W),0<q¢ <(P/W). SHIFT(A,1t, W) on an 
SIMD computer can be performed in 2logW unit routes [PRAS87]. 
A minor modification of the algorithm given in [PRAS87] performs 
i =2” shifts in 2 log( W/¢) unit routes ([RANK87b)). 


2.3.6. Two Dimensional shift 


SHIFT2D(A, 7%, 7, W, L) is used in conjunction with a two 
dimensional interpretation of a hypercube (or a grid mapping); 
A(a, b) is shifted to A((a—¢) mod \\', (b—j) mod L) in each WXL 
window. This can be done by first using SHIFT along one dimen- 
sion and then along the other. 


In a SHIFT2P, a WX W window is assumed and the amount 
of shift in each window can be different. A SHIFT2P takes at most 
4log W unit routes on an SIMD hypercube. 


2.3.7. Data Accumulation 


For this operation, PE 7 has an array A[0..M —1] of size M. 
The notation A[?|{7) refers to the element A[?] in PE 7. In addi- 
tion, each PE has a value in its I register. After the data accumu- 
lation, the M elements of A in each PE j are such that: 
Alt](y) =1((g +1) mod P), 0<t<M, 0<j<P 


Data accumulation may be done efficiently by adapting the 
data circulation algorithm of Figure 2. Procedure ACCUM(A, I, M) 
can be completed in 2(M/ — 1) + log(N/M) unit routes [RANK87b]. 


2.3.8. Adjacent Sum 


This operation is defined in [PRAS87]. For each PE, p, 


O0<p<P, the sum ss 


= Dali 


is to be computed. ‘Data accumulation may be done efficiently by 
adapting the data circulation algorithm of Figure 2. The number 
of unit routes required to complete AdjacentSum(A, M) 


4M —4 + 2log(P/M) [RANK87b]. 


((p +7) mod M)) 


3. ONE DIMENSIONAL CONVOLUTION 


The inputs to the one dimensional convolution problem are 
vectors I|0..N —1] and T[0..M —1]. The output is the vector C1D 
where: oar, 


SUI[(i + v) mod N|*T[o] 
v=O 
We use the computation of C1D as a basic step in our algo- 


rithms to compute C2D. In this section, we develop algorithms for 
CiD. We consider three cases: 


(i) Each PE has O(M) memory 
(ii) Each PE has O(log M) memory 
(iii) Each PE has O(1) memory 


Our algorithms assume that there are P = N processors and 
that the vector I is mapped onto the hypercube using the identity 
mapping (i.e., J(7) on PE ¢) . Further, we assume that there are 
(N/M) copies of T in the hypercube with one copy in each block of 
M processors. Within a block, the mapping of T is this same as 
that of I. 


C1D{[i] = ,O<v<N 


3.1. O(M) Memory 
When each processor has O(M) memory, the most effective 
way to compute O1D is to first perform a data accumulation on I. 


Following this, each processor has all the I values needed to com- 
pute the corresponding entry of C1D. Next, the T values are circu- 
lated through each ‘block of M processors. During this circulation, 


procedure C1D_M(M) 


{ O(M) memory one dimensional convolution} 


begin 
ACCUM{(A, I, M); 
C1D :=0; 


in :=p mod M { in = index of T in processor p} 
for j] :=1 toM do 


begin 
C1D := CID iia ies 
i te i); 
T —T' 
in :=in 9 rie 
end 


end; { of CID_M} 


Figure 3: O(M) memory computation of C1D 
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the T values are multiplied by I values and the C1D values com- 
puted. Procedure C1D_M (Figure 3) provides the details. The data 
accumulation takes 2(M —1)+log(N/M) unit routes and the for 
loop requires another 2M. The total number of unit routes is there- 
fore 4M + log(N/M) —2. Note that while the final shift on T is 
not necessary for the computation of C1D, our algorithms for C2D 
assume that T is unchanged by the C1D algorithms. This final shift 
restores the original T values. 


3.2. O(log M) Memory 


Since O(log M) memory is not sufficient to perform a data 
accumulation, we need to devise another strategy to compute C1D. 
Following the strategy in procedure AdjacentSum, each PE com- 
putes two sums A and B. A is the sum of all terms in the C1D for 
that processor for which the I values are in the M block containing 
the processor. B is the sum of all terms in the C1D for the 
corresponding processor in the previous M block for which the I 
values are in this M block. Figure 4 shows the components of the A 
and B sums for each processor in an M block of processors (in the 
figure, M=8). The processor and I value indexing is relative to the 
block. The absolute index is obtained by adding Mk, where k is 
the block index, to the relative index. Values above and including 
the off diagonal correspond to A while those below correspond to 
B. 


Py IpTyo +1,T, + 1,7. +13T3 +1,T, 

a LT, +17, +17, +17, +17, 

Py [nT ) +17, +17. +1573 +1gT, 

fae IeT 5 +1,T, ! 6 peed hy ! rT at I,T 5 di 

Py UgTy +157, +1gT, + lp Te BT, + 1,75 +175 +1577 
P, LT, +17, +1,T, CULT, +T, +1, + h,Ty 
Pe I,T) +1,T OTT, +17, +17, +17, +167, 
Pr LT yA “TNT, +50, +h, +17, +17, +17, 


Sums above and including the off diagonal are A 
Sums below the off diagonal are B 
Figure 4: A and B values to be completed by each PE 


A and B can be computed recursively by decomposing a prob- 
lem of size M into four problems of size (M/2) each as shown in 
Figure 5. Problems (a) and (c) can be solved in parallel and so also 
can problems (b) and (d). The algorithm is given in Figure 6. 


The number of unit routes required by Recursive C1D is 
given by the recurrence: 
2routes(M/2)+8 M >1 
tes(M) = 
routes(M) i nd 
=8M -—8 
Adding to this the number of unit routes required by the 
SHIFT(-M, B, p), we get 8M + log(P/M)+ O(1) as the number of 
unit routes required to compute C1D using O(logM) memory, 


Note that M invocations of the above algorithm will require 
8M + O(MlogN) unit routes. In case the image values remain the 
same for all invocations the O(MlogN) factor can be reduced to 
O(logN). This is achieved by making each block of size M calcu- 
late all the result values itself. A SHIFT(-M, A, P) is performed 
before invoking C1D_logM. Now each block has all the I values it 
requires for its convolution values. For every invocation of one 
dimension convolution two C1D_logM (without line 4 and 5) pro- 
cedures are invoked. The first with the original I values and the 
second with the new I values recieved (i.e the one from the next 
block). By adding the A values of the first call with the B values of 
the second, we get the desired convolution. Thus M invocations will 
require 16M> + O(logP) +O(M) unit routes. Further optimization 
is possible. Notice that if all the J, terms below the off diagonal of 
the matrix are replaced by [(; 419 moa p» then the sum B will 
represent the values corresponding to its own block. Moreover the I 
values are not moved in the algorithm. Thus by passing M values 
along with their index values and by modifying line 10 


[pT y tL, Ty + [oT . + 1373 


Po 

P, LT +17, +157, - Lob 
Py: Islo¥ied siglo thts 
Py 1,7). M97, +1, T. + lols 

(a) 

Pi, 1,7,4+15Ts +1eT5 + LT 7 
Pr [gh +1[oT, +17, . [4T7 
Pe, IgT,+1,75,-14T 5 +1577 
P, 1,T7,. 1,7; +15T5 +1eT7 


Py [oT t 415 + lols t+ LaF 7 
Fe ee es 
PEt ete ee 
Pe i 
(6) 
Pe ee a ae 
Po Let 61a ke 
Po. Lia Ee eee, 
Py UyTy 147, +15T. + Les 


(d) 


Figure 5: 4 decomposed problems with M = 4 
Bec Eg a a i a ee 


appropriately we can make sure that each term is calculated by 
using either J; or Li; 4) moa pp» It can be easily shown that 
12M” + O(logN) + O(M) unit routes are required. 


3.3. O(1) Memory 


First, we develop two data rotation patterns that are needed 
by our O(1) memory algorithm. The first pattern obtains all circu- 
lar shifts of even length in the interval [1,4¢ —1]. There are 
exactly (M/2)—1 such shifts (recall that M =2” is a power of 2). 
A shift distance sequence, E,, is a sequence d,d.-::'d¥,., . of 


positive integers such that a clockwise shift of d,, followed by one 
of d,, followed by one of d., etc. covers all even length shifts. 


Note that £, =, = null as there are no even length shifts in 
the range [1, 2” —1] when m =0 and 1. EF, =2. This transforms the 
length Mf =2” sequence abcd into the sequence cdab. In general, 
the choice #, =2, 2,2, --- will serve to obtain all even length 
shifts. From the complexity standpoint this choice is poor as each 
shift requires 2log(M/2) unit routes. Better performance is 
obtained by defining 

hyo =, =null, E, =2 

E,, = InterLeave(E,_,, 2” k>2 os 
where InterLeave is an operation that inserts a 2 in front of 
E,_,, at the end of &,_,, and between every pair of adjacent dis- 
tances in &,_,. Thus, 

FE, = Interleave(E, 4) 


kt) 


=424 
E\, = Interleave(E, 8) 
=8482848 


When a shift sequence E, is used, the effective shift following 


k 
d; is (}jd,) mod 2" Thus when Ez is used on the sequence 
j=l 


line procedure C1D_logM(M) 


1 { O(log M) memory algorithm for C1D} 

2 begin 

3 RecursiveC1D(M, A, B); 

4 SHIFT(B, -M, P); 

5 C1D :=A +B; 

6 end; { of C1D_log M} 

7 procedure RecursiveC1D(M, A, B) 

8 { compute A and B in M blocks} 

9 begin 

10 if M=1 then | A :=I*T; B :=0; return | 

11 RecursiveC1D(M/2, Al, B1); { problems (a) and (c)} 
12 b :=log,M - 1; 

13 T(p) —T(p". { Po>— P, and P,— P, exchange T values} 
14 RecursiveC1D(M/2, A2, B2); { problems (b) and (d)} 
15 T(p) —T(p"), { restore T values } 

16 X :=B2; (p, =1) 

17 X :=B1; (p, =0) 

18 Y :=A1; (p, =1) 

19 Y := A2; (p, =0) 

20 xX ax"). { move partial sums to correct PEs} 

21 Y —y). { move partial sums to correct PEs} 

22 A :=Al1+X+4/Y; (p, =0) | 

23 A :=A2+X+4+Y;(p, =1) 

24 B:=B2;(p, =0) 

25 B :=B1; (p, =1) 

26 end; { of Recursive C1D} 


Figure 6: O(log M) memory computation of C1D 
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abedefgh, we get 


d sequence effective shift 


4 ef ghabcd 4 
2 ghabcdef 6 
4 cdef ghab 2 =10 mod 8 
Theorem 1: Let Ek, 7] be d; in the sequence E,, k>2. Let 
ESUMIk, t] = (Elk, j]) mod 2°. = Then = { ESUM[k, i] | 
i<i<o't_y 270, 4 6, 8, ..., 27-2}. 
Proof: See RANK87b}. 
Theorem 2: The shift sequence E, can be done in 


2(2*— k —1) unit routes, k>2. 
Proof: See [RANK87b]. | 


The result of the preceding theorem is se a as it says 
—k —1) 
kal 

we can perform even length rotations with O(1) average cost. This 
is crucial to our algorithm. 


that the average cost of rotation in E, is <4. So, 


Let F, be the sequence obtained by dividing each distance in 
E,, by 2. So, Fo= Fy= null, Fy= 1,F',= 2, 1, 2, etc. 


Theorem 3: Let FSUM[k, i] =(S\F[k, j]) mod 2° 7 


j=l 


where 


F\k, 7] is the y’th distance in F, 
(a) {FSUM|k, i1<i<o* 74} ={1, 2,3,..., 9° ty 


(b) All the shifts in F, can be done in a window of size 2°~ 
2(2° —k —1) unit routes. 


in 


Proof: Similar to the proof of Theorems 1 and 2. 


As in our earlier algorithms each PE will compute two quan- 
tities A and B. For any PE, A is the sum of all the C1D terms 
that are in the M block containing the PE. B is the sum of all C1D 
terms that are needed by the corresponding PE in the previous M 
block. The terms contributing to A and B are shown in Figure 4. 
The AB values are computed in two stages. In the first, we com- 
pute the contribution to A and B by all I terms I, for 7 even. In 
the next stage, we do this for the case j odd. 


Consider the case M=8. If we begin by computing the terms 
on the major diagonal of Figure 4 , then PEs (0, 1, 2, ,,, 7) compute 
(WoT o, 147, [4T 2, 1573, [pT 4, [pT 5,147 5, pT 7). The I and T values 
required by each of the 8 PEs are shown in the first two rows of 
Figure 7. Notice that if we rotate the I values in windows of size 4 
by some amount j, then the T values need to be rotated by 27 so 
that each PE has a pair (J, T) whose product is needed in the 
computating of its A or B value. For this rotation we use the 
sequences F, and E,. Rotating I by F'(3, 0] in size 4 windows and T 
by £[3,0] in a size 8 window gives the next two rows of Figure 7. 
The result of performing the remaining rotations is also given in 
Figure 7. Figure 8 gives the computation of the odd terms. 


Figure 7: 


Computing the even terms 
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Figure 8: Computing the odd terms 


The initial configuration for the I’s can be obtained by con- 
centrating the even I’s using the strategy described in Figure 9 for 
the case of M 16. This requires logM unit routes. Let 
CONCENTRATE(I, M) be the algorithm that does this. The algo- 
rithm for one dimensional convolution now takes the form given in 
Figure 10. Note that the E’s and F’s are known only to the control 
unit. These may be computed, on the fly, in linear time using a 
stack of height m =logM. The memory required in each hypercube 
PE is only O(1). Lines 5 through 15 handle the even terms . 
Notice that (CShift + 2p) mod M gives the index of the I value 
currently in C(p). So, if this index is less than p the term CD 
corresponds to the previous block. Otherwise the term CD is for 
this PE. The fact that each PE always has a C and a D whose 
product contributes to either A or B follows from the observations 
that this is so initially and on each iteration, D rotates twice as 
much as ©. The total number of unit routes is 


8M + O(logP) + O(logM). 


Let us consider M invocations of one dimensional convolution 
with the same I values. By an argument similiar to the one 
presented in the previous sub-section, M invocations can be com- 
pleted in 16M> + O(logN) +0(MlogM). We perform a SHIFT(-M, I, 
P), followed by Line 7, 18, 19 on the old and new I values and 
store these results for he later invocations. Now by defining C = 
{Lota, Inewt and modifying steps 11 and 28 so that they calculate 
terms for this block, we can show that M invocations can be com- 
pleted in 12M° + O(logN) + O(MlogM). 


4. TWO DIMENSIONAL CONVOLUTION 


Assume that P = N’ PEs are available. These may be viewed 
as an N X N array as described in Section 2. We assume that 
I(i, j) is initially in the I register of PE(i, j). Further since N and 
M are assumed to be powers of 2, the N X N array may further 
be viewed as composed of (N7/M2) arrays of size M xX M. We 
assume that T is initially in the top left such array. 

4.1. O(M) Memory 

When O(M) memory is available, PE(i,j), l<i<N, 1<j<N 

computes M one dimensional convolutions S|q|,0<q <M defines 


as below wes 


= SUI {(é, 
r=O 
Next, C2D is obtained by performing an adjacent sum opera- 
tion along the columns of the N X N PE array. A high level 
description of the algorithm is given in Figure 11. 


The total number of unit routes is 2M° + O(M +logM ). 


(7 +r) mod N|*T{q, 17] 


The 


number of unit routes for Steps 1-4 are log(N /M’), 
2M +log(N/M), M(2M +log(N/M) and 4M —4 +2log(N/M) 
respectively. 


line 


1 
2 
3 
4 
5 
6 
7 
8 
9 
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Figure 9: Initial configuration for even terms 


procedure C1D_1 (M) 
{ O(1) Memory C1D algorithm} 


begin 


A :=0; B :=0; m = log M; 
{even terms} 

CS] p= TT: 

Cshift := 0; 
CONCENTRATE (C, M); 
for] :=1 to M/2 do 


begin 
A :=A+C *D; ((CShift + 2p) mod M > p) 
B :=B +C * D; ((CShift + 2p) mod M <p) 
SHIFT(C, F[m, j - 1], M/2); 
CShift := (CShift + F[m, j - 1]) mod (M/2); 
SHIFT(D, E[m, j - 1], M); 

end 

{ odd terms} 

CID :=T; 


SHIFT(C, -1, P); CShift := 1; SHIFT(D, -1, M); 
CONCENTRATE(C, M); 
for) :=1 to M/2 do 


begin 
A :=A+C *D; ((CShift + 2p) mod M > p) 
B :=B +C * D; ((CShift + 2p) mod M <p) 
SHIFT(C, F{m, j - 1], M/2); 
CShift :=(CShift + F[m, j - 1]) mod (M/2); 
SHIFT(D, Elm, j - 1], M); 

end | 

SHIFT(B, -M, P): 

C1D :=A +B; 


end; { of C1D_1} 


Figure 10: O(1) memory SIMD C1D algorithm 
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procedure C2D_M(N, M) 
{ assumes O(M) memory per PE} 
Step: Broadcast T to all M x M blocks in the N X N PE ar- 
ray 
Step2: Perform a data acumulation on I. Now each PE con- 


tains the M I values it needs to compute its S(q)’s. 


Step3: Compute the S(q)’s. Each S(q) is a one dimensional 
convolution. However, the data accumulation step of 
the algorithms of Figure 10 may be omitted as the I 
values have already been accumulated in Step 2. To 
go from one S to another, the T values need to be cir- 
culated along the columns of each M X M block. This 
can be done using the data circulation algorithm of 
Section 2. 

MH 


Step4: Compute C2D{i, j] = S)S[r]((i +r) mod N, 7). This 


r=O 
is done using the adjacent sum algorithm of Section 2 
on the columns of the N x N PE array 


end 


Figure 11: High level description of two dimensional convolution 
with each PE having O(M) memory 


Now, it is not possible for each PE to accumulate the M 
values of [ it needs from its row. Nor is it possible for a PE to com- 
pute the values S[q|,0 <q <M. The new strategy is similar to that 
used in computing C1D when only O(1) memory is available. We 
may rewrite the definition of C2D as 


C2D{t, J] = S}CXD{i, r, 7] 


r=O 
where re 
OXD{i, r, 7] = SoI[(¢ +r) mod N, (j +a) mod N]*TIr, a] 
a= 


Since each CXD is a one dimensional convolution, it can be 
; | a ' 
computed using algorithm C1D_1 PE(?, 7) computes 


k= > CXD|i, r j| and 
r=Q 
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F= > CXD|(i—M) mod N,r, j|. Thus each PE com- 
putes aval for itself (i.e., E) and a value for the corresponding 
PE in the adjacent upper M x M block (i.e., F). The F values are 
then shifted M units along the columns and added to the E values 
to get the C2D values. A high level description of the algorithm is 
provided in Figure 12. In iteration k of Step 3, the PEs in column 7 
of an M X M PE block compute the CXD terms needed for the E 
and F of PE(li/M|M +k,j). Then in Step 4, these terms are 
added together to get the E and F for this PE. The time complex- 
ity of Steps 1, 3, 4 and 5 is log(N° /M’), 12M + O(logM), 3logM 
and 2log(N/M) respectively. The total number of unit routes 
taken is 12M? + O(M logM) + O(logN). A slightly more efficient 
algorithm results if we interpret C2D as: 


Cadi, j] = OXTt, 9, e] * Y1r] 


r= 

where X(t, 7,7] is the 1 x M vector 
I|(t+r) mod N, 7 .. (¢+M-1) mod N] and Y{r] is the 1 X M vector 
T(r, 0.. M—1]. Thus C2D is viewed as the one dimensional convo- 
lution of X and Y where X and Y are vectors. We can extend algo- 
rithm C1ID_1 to obtain an = algorithm that requires 
12M” + O(M) + O(logN) unit routes and computes this one dimen- 
sional convolution. This algorithm is quite a bit more complex than 
Figure 12 and is omitted. 


procedure C2D_1(N, M) 
{ assumes O(1) memory per PE} 


Stepl: Broadcast T to all M xX M blocks in the N X N PE ar- 
ray 
Step2: Repeat Steps 3 and 4 for k :=0 to M-1 
D 
Step3: Compute CXD|(|— |M +k) mod N, i mod M—k, j| 
M 


if ¢ mod M>k using C1D_1(M) and put the result 
in A, otherwise A = 0; 
Compute 


OXD|(|—— |M + k—M) mod N,i mod M—k+M,¥]J] 


if 7 mod M <k using C1D_1(M) and put the result in 
B, otherwise B = 0; 

Step4: Use the data sum operation, described in Section 2, to 
sum the results for the adjacent upper block and itself, 


by summing up B’s and A’s in PE( 


a 
a +k, Jj) in 
M 
F and E respectively. Shift the T values along 
columns by 1, using the algorithm of Section 2. 
Stepd: Shift(F, -M, N) along columns. E:=E+F. 


Figure 12: High level description of two dimensional convolution 
with each PE having O(1) memory 


4.2. O(log M) Memory 


The algorithm for two dimensional convolution for Each PE 
having O(logM) memory is the same as that of Figure 12. The 
only difference is that the one dimensional convolution used in Step 
2 refers to one dimensional convolution with O(logM) memory. The 
number of unit routes required is 12M* + O(MlogM)+O(logN). 
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5. EXTENSIONS 


It is easy to see that the algorithms developed for two dimen- 
sional convolutions in the previous section can be extended to win- 
dows of size M, x M, (where M, and M, are powers of 2). Let us 
consider the case where we have an M X M window and M is not a 
power of 2. In this case, let m,m,_, ..... my be the binary represen- 
taion of M. The M X M T matrix may be viewed as several q! x 2* 
matrices (m, =m, =1). A two dimensional convolution is performed 
for each such matrix and the results added. 


When the number, P, of processors is Nee 1<K<M the 
techniques of [PRAS87] may be used to extend our algorithms to 
obtain optimal N°’K° PE algorithms. 


6. CONCLUSION 


In this paper, we have presented optimal algorithms for 1-D 
convolution and Image Template Matching (2-D Convolution). 
These algorithms use novel strategies to achieve optimal speed-up 
using O(M), O(logM) and O(1) memory per PE for an M X M 
Template. Our algorithm for O(1) memory is asymptotically 


faster then previously known algorithms. Unlike previous algo- 
rithms for this problem, our algorithms do not use data broadcast- 
ing. 
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IMAGE TEMPLATE MATCHING ON MIMD HYPERCUBE MULTICOMPUTERS ~* 


Sanjay Ranka and Sartaj Sahni 


University of Minnesota 


Abstract 


Efficient algorithms for image template matching on fine 
grained as well as medium grained MIMD hypercube multicomput- 
ers are developed. Template matching algorithms for MIMD hyper- 
cube multicomputers have not, to our knowledge, been previously 
developed. The medium grained MIMD algorithm is developed 
specifically for the NCUBE multicomputer. This algorithm is com- 
pared experimentally with an algorithm that is optimized for the 
CRAY2 supercomputer. In addition, customized algorithms are 
developed for Kirsch templates. 


1. INTRODUCTION 


The inputs to the image template matching problem are an 
NXN image matrix [[0..N —1, 0..N —1] and an MXM template 
T|0..M —1, QM! = hi, The output is an NXN matrix C2D where 


C2D[t, 7] = 3) Sli tu) mod N, (7 +) mod N] * Tlu,v] , 
u=O v= 
0<71,7<N 

C2D is called the two dimensional convolution of J and T. Tem- 
piate matching, i.e., computing C2D, is a fundamental operation 
in computer vision and image processing. It is often used for edge 
and object detection; filtering; and image registration [ROSE82, 
BALL85]. Because of the fundamental nature of this problem and 
because of its high complexity (O(M°N’) on a single processor com- 
puter), much attention has been devoted to the development of 
efficient fine grain multicomputer parallel algorithms. For exam- 
ple, Chang, Ibarra, Pong and Sohn [CHAN87] have studied this 
problem on an SIMD pyramid computer; Ranka and Sahni 
[RANK87a], Maresca and Li [MARE86], and Lee and Agarwal 
[LEE87| have considered mesh connected computers; and Fang, Li 
and Ni [FANG85], Fang and Ni [FANG86], Prasanna Kumar and 
Krishnan [PRAS86] and Ranka and Sahni [RANK87b] have con- 
sidered SIMD hypercube multicomputers. 


In this paper, we restrict our attention to MIMD hypercube 
multicomputers. We assume that M and N are powers of 2 and 
develop two asymptotically optimal algorithms that require N’ 
processors. These require O(M) and O(1) memory per processor, 
respectively. The algorithms developed in [PRAS87], [LEE87], 
[FANG85], [FANG86], and [MARE86] require a broadcast capabil- 
ity. Our algorithms do not require this. Using the techniques of 
[PRAS87|, both our algorithms may be generalized to obtain 
asvmptotically optimal algorithms for MIMD hypercube computers 


with N°K?, 1<K <M processors and O(M/K) and O(1) memory 
per processor respectively. Our medium grain MIMD algorithm is 
developed for the NCUBE hypercube. This algorithm is evaluated 
experimentally and a comparison with a single processor CRAY2 is 
made. 


Section 2 describes our hypercube model. In addition, nota- 
tion and some fundamental data movement operations are 
developed in this section. In Section 3, we develop fine grained 
algorithms for one dimensional convolution. These form a basic 
component of our two dimensional convolution algorithms which 
are developed in Section 4. Section 5 considers Kirsch templates 
and in Section 6, the medium grain MIMD algorithm for the 
NCUBE together with one for the CRAY2 supercomputer are 
developed. Experimental results are also presented in this section. 


* This research was supported in part by the National Science Foundation 
under grants DCR84-20935 and MIP 86-17374 


2. PRELIMINARIES 


2.1. Hypercube Multicomputer 


The important features of an MIMD hypercube and the pro- 
gramming notation we use are: 


1. There are P =2” processing elements connected together via 
a hypercube interconnection network (to be described later). 
Each PE has a unique index in the range (0, 2” —1]. The 
local memory of each PE holds both the data and the pro- 
gram that the PE is to execute. Throughout this paper, we 
shall uses brackets(| |) to index an array and 
parentheses(’()’) to index the PEs. Thus A[?] refers to 7’th ele- 
ment of the array A while A(t?) refers to the A register of PE 
i. Likewise A[7|(7) refers to the ¢’th element of array A of 
PE 7. 


2. A p dimensional hypercube network connects 2” PEs. Let 


tytty—o:+--%g be the binary representation of the PE index . 
Let a, be the complement of bit ¢,. A hypercube network 
directly connects pairs of processors whose indices differ in 
exactly one bit. Le., processor ty 1tp-a-++ 40 is connected to pro- 
cessors #,_) °° ° pees, O<k<p—1. We use the notation 7 ‘ to 
represent the number that differs from 7 in exactly bit 6. 


3. At any given instance, different PEs may execute different 


instructions. In particular, PE 2 may transfer data to PE 7°’, 
while PE j simultaneously transfers data to PE 7”, ab. 


4. An instruction mask is a boolean function used to describe 


PEs which will remain active during an instruction. For 
example, in the instruction 

A(t) :=A(¢) +1, (t) =1) 
(9 =1) is a mask, which states that only PEs with index bit 
0 equal to 1 remain active during the instruction. I.e., odd 
indexed PEs increment their A register value by 1. We shall 
often omit the PE index from our instructions. Thus, the 
above statement can also be written as 

A:=A4+1, (t=1) 


5. Interprocessor assignments are denoted using the symbol <-, 


while intraprocessor assignments are denoted using the sym- 
bol :=. Thus the assignment statement: 

Bi) —B(i), (ig =0) 
is executed only by the processors with bit 2 equal to 0. 
These processors transmit their B register data to the 
corresponding processors with bit 2 equal to 1. 


6. In a unit route, data may be transmitted from one processor 


to another to which it is directly connected. We assume that 
the links in the interconnection network are unidirectional. 
Hence at any given time, data can be transferred either oom 
PE i (i, =0) to PE #”) or from PE i (i, =1) to PE i 
Hence the instruction. 

B(i®) —B(A), (¢. =0) 
takes one unit route, while the instruction: 

B(i”) —B(i) 
takes two unit routes. 

7. Since the asymptotic complexity of all our algorithms is 

determined by the number of unit routes, our complexity 
analysis will count only these. 
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2.2. Hypercube Embedding of a Grid 


Figure 1 shows an embedding of a 4X4 image grid into a 
hypercube of dimension 4. The number inside a box is the binary 
representation of the index of the PE to which that element is 
mapped. The embedding of Figure 1 uses the binary reflected gray 


code mapping of |[CHAN86]. An 7 bit binary gray code S, is defined 
recursively as below: 

$01; 5, =O15, 4115p)" 
where (S,_,] is the reverse of the k —1 bit code S,_, and 6[S] is 
obtained from S by prefixing 5b to each entry of S. So, 
S =00, 01, 11, 10 and S, =000, 001, 011, 010, 110, 111, 101, 100. 


0011) | 0010 
Gre 


Figure 1: Embedding of a 4 X 4 mesh in a hypercube of dimension 4 


If N =2", then Sy, 18 used to map an N X N grid into a 
P=N ny oureube. The elements of S,, are assigned to the ele- 
ments of the NxXN grid in a sake like row major order 
[THOM77|. This mapping has the property that grid elements that 
are neighbors are assigned to neighboring hypercube nodes. 
Another interesting property is evident from the definition of 5S, 
and the linear drawing of Figure 2. In this figure, PEs appear in 
the order given by S,. A hypercube has circular lists of length 2! 
for all « embedded in it. Furthermore, these circular lists of length 
2' are present in every row and column of the grid embedding. 
Also, the PEs in each circular list of length 2' form a 2’ PE subhy- 
percube, ¢ >1. 


Figure 2: Rings of size 2, 4 and 8 in an 8 PE hypercube 


For a hypercube with P =2? PEs, we define the function 
gray(t) such that gray(0) =O and gray(i) is the index of the PE 
that immediately follows the PE gray(i—1) in the circular list of 
size 2” obtained from the, above gray code embedding. For the 
example of Figure 2, P =2° =8, gray(0..7) = (0, 1, 3, 2, 6, 7, 5, 4). 
The function itgray is the inverse of gray. So, 
igray(O .. 7) =(0, 1, 3, 2, 7, 6, 4, 5). pp 
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2.3. Basic Data Manipulation Operations 


2.3.1. Data Circulation 


Consider a P =2” processor hypercube. We are required to 
circulate the data in the A register of these PEs so that this data 
visits each of the P processors exactly once. This operation is easy 
to accomplish on an MIMD hypercube using the binary gray code 
mapping and the observation that we have a P processor circular 
list (Figure 2). We simply shift the contents of the A registers cir- 
cularly clockwise by 1 each time (Figure 3). Procedure CIRCU- 
LATE circulates the A register data through the P processors in 
O(P) time. This is trivially optimal. 


procedure CIRCULATE(A); 
{data circulation} 
fori=1toP-1 do 
A(gray(j) ) -A(gray((j + 1) mod P)); 


end 


Figure 3: Data circulation in an MIMD hypercube 


2.3.2. Data Broadcast 


In a data broadcast, data originates at one PE and is to be 
transmitted to the remaining P—l PEs. This can be done in logP 
unit routes using a tree broadcast scheme [DEKE81]. 


2.3.3. Window Broadcast 


Assume that W is a power of 2 and that a P processor 
hypercube is tiled by windows of this size such that each window 
forms a subhypercube with W PEs. In a window broadcast, data 
originates in one of these windows (different data in different PEs 
of the window). The data in this window is to be copied to the 
remaining P/W —1 windows. This copying can be done using 
log(P/W)+2 unit routes [DEKE81]. 


2.3.4. Data Sum 


Assume the window tiling of Section 2.3.3.. The data in each 
of the windows is to be summed and the sum left in a prespecified 
PE (same relative PE for each window). For example, if we are 
summing the A register data, we may be required to compute: 


Sum(index(iW)) = 5) A(index(tW+7)), O<t<(P/W) 

=O 
Here, index(q) gives the physical index of the g’th PE in the tiling 
scheme. We assume that the P PEs are first ordered (for example 
using an S.) and then tiled using 1 X W tiles. Thus, <W + 7 is the 
jth PE in the 7’th tile and «W is the 0’th PE in the 7’th tile. Data 
sum can be done in log W unit routes [DEKE81]. 


2.3.5. Shift 


SHIFT(A ,t,W) shifts the A register data circularly counter- 
clockwise by ¢ in windows of size W. Le., A(gray(qW-+yj)) is 
replaced by A(gray(qW +(j-t) mod W)), 0< q <P/W), 
0 <j < W. In a gray code indexing, the indexing within each size 
2’ window also corresponds to a gray code (consider the least 
significant j bits). Hence each pair of adjacent size 2’ windows 
differs in exactly one bit. Now suppose the shift amount 7 is a 
power of 2. We can get data to the correct size ¢ window by rout- 
ing along the single bit in which adjacent size « windows differ 


Following this, the data in each size 7 window needs to be reversed 
(unless ¢ =1). This reversal may be accomplished by exchanging 
data in the two size 1/2 windows that make up a size « window. 
The total number of unit routes required when 7 is a power of 2 is 
therefore at most 2 to get the data to the correct size « window 
(note that 2 routes are needed when ¢ = W/2 and one otherwise) 
plus at most 2 to reverse within the size 7 window. Hence at most 
4 unit routes are needed to perform a shift of size ¢. When 7 is not 
a power of 2, 7 can be written as the sum of powers of 2 and the 
shift obtained by performing successive power of 2 shifts. Since 
only one of these can be a W/2 shift, the number of unit routes is 
at most 3#1{1) +1, where #1(7) is the number of ones in the 
binary representation of 7. The worst case performance can be 
kept at 3(logW)/2 +1 by noting that if there are more than 
(logW)/2 one bits, we can do a W-—1—+ clockwise shift followed by 
a unit clockwise shift. Also note that the special cases of 7 =1, 2, 
and 3 are easily done in 7 unit routes unless W =1 (in this case, a 
shift of 1 takes 2 unit routes). 


2.3.6. Data Accumulation 


For this operation, PE 7 has an array A[0..M —1] of size M. 
In addition, each PE has a value in its I register. After the data 
accumulation, the M elements of A in each PE 7 are such that: 
A[t}(gray(7)) = I(gray((g +%) mod P)), 0<i<M, 0<j<P 
This can be accomplished in M-1 unit routes (for P > 2) by 


repeatedly shifting by -1 in windows of size P. The algorithm is 
given in Figure 4. 


procedure ACOUM(A, I, M) 
{each PE accumulates in A, the I values of the next M PEs 
including itself} 


begin 
AO] :=J; 
fori :=1 to M-1 do 
begin 
SHIFT(I, -1, P); 
Ali] :=I; 
end 
end { ACCUM} 


Figure 4: Data accumulation 


2.3.7. Adjacent Sum 


This operation is defined in [PRAS87|. For each PE, p, 
O0<p<P, the sum ps 


T(gray(p)) = SiA[t|(gray(p +4) mod M)) 
i=0 
is to be computed. 

As mentioned earlier, every hypercube of size P can be 
viewed as consisting of P/M subhypercubes (blocks) each of size M. 
For every PE p, some (or all) of the A’s needed to compute 
T(gray(p)) are in the block containing PE p.The remainder are in 
the next block of PEs. The strategy to compute T is as follows: 


1) Each PE, p, begins with two variables S and T (initially 0). 
These values circulate through the M PEs in the block. T 
accumulates the A values in the block needed in the sum for 
T(gray(p)). S accumulates the A values needed for 
T(gray((p —M) mod P)). 


2) The S values are shifted clockwise by M positions and added 
to the T values. 


The formal algorithm is given in Figure 5. The number of 
unit routes is 2M +4 (recall that M is a power of 2 and a power 2 
shift takes at most 4 unit routes). This can be reduced to M +4 by 
shifting S and T as a single packet. 


procedure AdjacentSum(A, M) 
begin 
S:= 0; T:=0; 
for 1 :=0 to M-1 do 
begin 
T(p) = T(p) + Alil(p); (¢gray(p) mod M > i) 
S(p): = S(p) + Alil(p); (igray(p) mod M <i) 
SHIFT(T, 1, M); 
SHIFT(S, 1, M); 
end 
SHIFT(S, -M, P); 
T :=T +S; 
end {of AdjacentSum} 


Figure 5: Adjacent Sum 


3. ONE DIMENSIONAL CONVOLUTION 


The inputs to the one dimensional convolution problem are 
vectors [[0..N —1] and T[0..M —1]. The output is the vector C1D 
where: 

M-i 
C1ID{[t] = S$I[(t + v) mod N|*Tlv] , 0<t<N 
y= 
We use the computation of C1D as a basic step in our algo- 
rithms to compute C2D. In this section, we develop algorithms for 
C1D. We consider two cases: 


(i) Each PE has O(M) memory 
(ii) Each PE has O(1) memory 


Our algorithms assume that there are P = N processors and 
that the vector I is mapped onto the hypercube using the gray 
code mapping (i.e., I{¢] on PE gray(?)) . Further, we assume that 
there are N/M copies of T in the hypercube with one copy in each 
block of M processors. Within a block, the mapping of T is this 
same as that of I. 


3.1. O(M) Memory 


When each processor has O(M) memory, the most effective 
way to compute CID is to first perform a data accumulation on I. 
Following this, each processor has all the I values needed to com- 
pute the corresponding entry of C1D. Next, the T values are circu- 
lated through each block of M processors. During this circulation, 
the T values are multiplied by the I values and the C1D values 
computed. Procedure CID_M (Figure 6) provides the details. Note 
that while the final shift on T is not necessary for the computation 
of C1D, our algorithms for C2D assume that T is unchanged by the 
C1D algorithms. This final shift restores the original T values. The 
number of unit routes taken is 2M. 
procedure CID_M 


{O(M) memory algorithm for one dimensional convolution} 
begin 


ACCUM(A, I, M); 
b := tgray(p) mod M; {relative index of PE in M block} 


C1D :=0; 
for} :=1 toM do 
begin 


C1D :=C1D + Alb] * T; 
b :=(b+1) mod M; 
SHIFT(T, -1, M); 
end | 
end; {of C1D_M} 
Figure 6: O(M) memory computation of C1D 


3.2. O(1) Memory 


When only O(1) memory per PE is available, we begin by 
first pairing I values in the processors. The pair in processor p is 
(A(p), B(p)) = (I (7M + 2k) mod NJ, I[(jM42k+1) mod N]) where 
i =igray(p), 7 = Li/MI], and k =i mod M. Figure 7 gives the ini- 
tial AB pairs in each PE for the case N =16, M =4. The algo- 
rithm to obtain this is given in Figure 8. 


p  t=igray(p) j k I AB 
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2 3 0: 3. 4. Ga 
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5 6 t.- 22. Se Eel 
4 i iL 38 Je. er 
12 g 2 30. Ie aa, 
13 9 2 ie er. 
15 10 OP he id 
14 11 OB te. ale 
10 12 S° 0: ay Tare 
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9 14 8 2 ae Ae. 
8 15 s 8 a. bi 
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Figure 7: Initial AB pairs for N =16,M =4 


procedure PAIRING(M) 
{ pairing I values in AB registers} 


begin 
= tgray(p); {p is processor index} 
B :=I[,; 
SHIFT(B, -1, P); 
A :=]; 
for} :=1 to logM-1 do 
begin 
C :=B; SHIFT(B, -2’~, M); 
B =O; (1, =0) 
© :=A; SHIFT(A, -2’~, M); 
A =O; (7, =0) 
end 
C :=B; SHIFT(B, -(M/2), P); 
B :=O; (t),54 —1 = 9) 
C:=A; SHIFT(A, -(M/2), P); 
A = O; (2 “ogM —1 = 0) 
end; {of PAIRING 


Figure 8: Pairing of the I’s 


The number of unit routes is at most 8 log M. This can be 
reduced to 4 logM by routing (A, B) pairs as single packets. 


Once the AB pairing has been done CID may be computed 
by rotating the AB values clockwise in a window of size P (in a 
single rotation, B’s move to A’s in the same PE and A’s move to 
B’s of the next PE) and rotating the T values clockwise in a win- 
dow of size M. Figure 9 shows the initial AB pairs and T values 
for the case N = 16 and M =4. Throughout the algorithm, the pro- 
duct of A(p) and T(p) will give one of the terms needed to compute 
C1D(tgray(p)) for every PE p. B(p) will be the next I value needed. 
Initially, this is true for all processes except those with 
igray(p) mod M =M —1. This situation is remedied by replacing 
B with I in these processors to get the first row labeled AB’. Fol- 
lowing a rotation of AB, we get the second row labeled AB. Now, 
the B value in processors with ¢gray(p) mod M =M —2 needs to 
be changed to /(p). With this insight, one arrives at the algorithm 
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of Figure 10. Its correctness is easily established. The number of 
unit routes (including those for pairing) is at most M + 8logM. 


4. TWO DIMENSIONAL CONVOLUTION 


Assume that P = N’ PEs are available. These may be viewed 
as an N X N array as described in Section 2. We use (7, 7) to refer 
to the PE in position(7, 7) of the N x N array. Thus, for the 
example of Figure 2, PE(0, 0) is PE 0, PE(2, 3) is PE 7, and PE(3, 
3) is PE 6. The index of PE(7, 7) is gray(tN +7) if ¢ is even and 
gray(iN + N —1— J) if ¢ is odd. This corresponds to the snake 
like row major interpretation. We assume that I[7, 7] is initially 
in the I register of PE(7, 7). Further since N and M are assumed to 
be powers of 2: athe N X N array may further be viewed as com- 
posed of N° /M arrays of size M X M . We assume that T is ini- 
tially in the top left such array. 


4.1. O(M) Memory 
When O(M) memory is available, PE(i,j), 0<i<N, 0<j<N 
computes M one dimensional convolutions S(q),0<q <M defined 


w 
as belo es 


= SUI((¢ 


r=O 


(7 +r) mod N)*T(q, r) 


Next, C2D is obtained by performing an adjacent sum opera- 
tion along the columns of the N xX N PE array. A high level 


description of the algorithm, is given in Figure 11. The total 
number of unit routes is M” + O(MlogM). The number of unit 
routes for each of the steps of Figure 11 is log(N/M) +4, M-1, 
M(M —1) and 2M +4 respectively. 


4.2. O(1) Memory 


Now, it is not possible for each PE to accumulate the M 
values of I it needs from its row. Nor is it possible for a PE to com- 
pute the values S(q),0 <q <M. We may rewrite the definition of 


C2D as ed 


= S)CXD{t, r, J] 


C2D{i, 7] 


where 


OxD{t, r, 7| 


M-i 
= YyI[(i +r) mod N,(j +a) mod NI*T{r, a] 


a=0 


Some of the CXD terms needed for the computation of 
C2D(%, j) can be computed within the M x M PE block that con- 
tains PE(7, 7) as all the needed I and T values are in the block. 
The remaining terms can be computed by the corresponding PE in 
the next lower M X M block as this block contains the needed I 
values. Thus each PE computes an E value (for itself) and an F 
value (for the corresponding PE in the adjacent upper M x M 
block). 


The E and F values are computed in k iterations. During 
iteration k, the PEs in the k’th row of each M X M PE block com- 
pute their E and F values. These rows have index k, M+k, 2M+k, 


-, Also a 
E(aM +k, 7) = S)} CXD[aM +k, 1, j] and 
mq 
F(aM +k, 7) = 3S) CXD|((a-)M +k) mod N, r, J]. 


r=M—k 
For this, we note that PE(7, 7) is in the ¢ mod M row of the 


Li/M]’th M x M block. So, each PE needs to compute 

A = CXD[li/MIM +k, Sod Me k, j)if? mod M>k and 
B =CXD(\i/MIM +k —M, i mod M —k +M, J] 

«mod M <k 


if 
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Figure 9: Execution Trace N = 16, 


a 
procedure C1D_1(M) 
{O(1) memory one dimensional convolution} 
begin 
PAIRING (M); 
CID :=-90; 
for | :=0 to M-1 do 
begin 
B(p) := I(p); (igray(p) mod M =M -1 -j) 
C1iD :=C1D +A * T; 
SHIFT(A, -1, P); 
C :=B; B :=A; A :=C; { interchange A and B} 
SHIFT(T, -1, M); 
end 


end {of C1D_1} 


Figure 10: O(1) memory computation of C1D 


rc 8 ees ee 
procedure C2D_M(N, M) 
{assumes O(M) memory per PE} 


Broadcast T to all M x M blocks in the N * N PE ar- 
ray 


Stepl: 


Step2: Perform a data accumulation on I. For this operation, 
the N x N PE array is viewed as N independent hy- 
percubes with each row forming one such hypercube. 
Following the operation, each PE contains the M I 


values it needs to compute its S(q)’s. 


Step3: Compute the S(q)’s. Each S(q) is a one dimensional 
convolution. However, the data accumulation step of 
the algorithm of Figure 9 may be omitted as the I 
values have already been accumulated in Step 2. To 
go from one § to another, the T values need to be cir- 
culated along the columns of each M X M block. This 
can be done using the data circulation algorithm of 
Section 2. 
MA 
Step4: Compute C2D(i, j) = 4) S[r]((i +r) mod N, 7). This is 
r=O 
done using the adjacent sum algorithm of Section 2 on 
the columns of the N x N PE array 


end 


Figure 11: High level description of two dimensional convolution 
with each PE having O(M) Memory 
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Then, the PEs in rows aM +k,0<a<N/M can compute E and 
F by summing the As and Bs in their column and in their M x M 
block. Once this has been done, C2D is computed by shifting the 
F’s up the columns by M units and adding to the E’s. A high level 
description of the algorithm is provided in Figure 12. The number 
of unit routes for Steps 1, 3, 4 and 5 of Figure i2 is 
3log(N/M) +2, 2M + O(logM), logM + O(1) and 4 respectively. 
The total number of unit routes is 2M°+0(M logM). 


ea eee ee 


procedure C2D_1(N, M) 
{assumes O(1) memory per PE} 


Step: Broadcast T to all M x M blocks in the N & N PE ar- 
ray 
Step2: Repeat Steps 3 and 4 for k :=0 to M-1 
F 
Step3: PE(i, 7) computes CXD[|—~ |M +k, i mod M—k, 7| 
if ¢ mod M>k using C1D_1(M) and puts the result 
in A, otherwise A =0; 
PE(?, 7) computes 
; 
CXD |( M+k—M) mod N,i mod M—k +M, J] 
if ¢ mod M <k using C1D_1(M) and puts the result in 
B, otherwise B = 0; 
Step4: Use the data sum operation, described in Section 2, to 
F 
sum the B’s and A’s in PE( M+k,j)in F and E 
M 
respectively. Shift the T values up the columns by 1. 
Step5: SHIFT(F, -M, N) along columns. C2D :=E+F. 


Figure 12: High level description of two dimensional convolution 


with each PE having O(M) Memory 


5. KIRSCH MOTIVATED TEMPLATES 


Kirsch templates, [BALL85] are commonly used in image pro- 
cessing. Kirsch templates of size 1 (M = 3) and 2 (M = 5) are 
shown in Figure 13. 


By exploiting the special structure of these templates, tem- 
plate matching can be done more efficiently. A high level descrip- 
tion of the algorithm is given in Figure 14. Its complexity is O(M). 
The amount of memory required per PE is O(M). While efficient 
O(1) memory algorithms can also be developed, we shall not do 
this here as Kirsch templates usually have small M and it is rea- 
sonable to assume this much memory is available. 


Steps 3, 4, 5, 6 can be done efficiently by a simple adapta- 
tion of procedure AdjacentSum of Section 2. 


j-To] a} | Of a an 
-H ol tf of 1 
(ies a 
(a) (b) 


a -7o0/ t/a] [Olilasalal 
eT IRIEL 


1! 111 


e (f) 


(s) 
Figure 13: Kirsch templates of size 1 (M 


= 3) and 2 (M = 5) 


ACCUM{(A, I, M); 
B[-1] :=0; C[-1] :=0; 
fori :=0 to M-1 do 
begin 
B{i] :-= Afi] + Bii-1]; 
C[i] :-= A[M-1-i] + Cli-1] 
end; 
Do exactly one of the following steps depending on 
the template type. 
{ Templates of types (a) and (e)} 
M+ 


C2D(t, 3) = 24(G[(M-8)/2|-B|(M-8)/2))((r+2) mod N, 3) 


a=0 
{ Templates of types (b) and (f)} 
M- 


C2D(i, 7) = 4) (C[M-2—a]—B[a—-1])((¢ + a) mod N, 7) 


{ Templates revere (c) and (g)} 
(M-8) /2 
C2D(t,7)= SY) C[M-]((¢ +a) mod N, 7) 
a=Q 
(M-1) 
3S) C[M-1]((¢ +a) mod N, 7) 


(M-+)/2 
{ Templates of types (d) and (h)} 


M-—i 


C2D(t, 7) = 4}(ClaA]—B[M-2-a])((¢ + a) mod N, 7) 


a=0 
Figure 14: Algorithm for Kirsch templates of Figure 13 


6. MEDIUM GRAIN TEMPLATE MATCHING 


In the previous sections we have developed algorithms to per- 
form template matching on a fine grain hypercube. Such a com- 
puter has the property that the cost of interprocessor 
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communication 1s comparable to that of a basic arithmetic opera- 
tion. In this section, we shall consider the template matching prob- 
lem on a a hypercube in which interprocessor communication is 
relatively expensive and the number of processors is small relative 
to the image size n. In particular we shall experiment with an 
NCUBE/7 hypercube which is capable of having up to 128 proces- 
sors. The NCUBE/7 available to us, however, has only 64 proces- 
sors. The time to perform a two byte integer addition on each 
hypercube processor is 4.3 microseconds whereas the time to com- 
municate 6b bytes to a neighbor processor is approximately 
447 + 2.46 microseconds. 


Several cases of the template matching problem can be stu- 
died. These vary in the initial location of the image and the 


template and the final location of the convolution (result matrix). 
We consider the following cases . In all of these, the template is 
initially in the host. 


1. Host-to-host: The image is in the host initially and the result 
is to be left in the host also. 


2. Hypercube-to-host: The image is initially in the host but the 
result is left in the hypercube. 


3. Hypercube-to-hypercube: The image is initially in the hyper- 
cube and the convolution is to be left there too. 


Let p be the number of hypercube processors. We assume 
that p is a perfect square and that Vp divides n. Hence, the 
hypercube may be visualized asa Vp X Yp mesh and the n X n 
convolution matrix can be mapped onto this with each processor 
getting an n/Vp x n/[Vp block. We assume that each processor 
has enough memory to hold one copy of the m XK m template. As 
far as mapping the nXn image is concerned, we consider the two 
possibilities: 


(1) Overlap Mapping: In this, each processor gets enough of the 
image to compute all its convolution values. Hence, the pro- 
cessor in position (0,0 of the mesh _ gets 


JO .. n[Vp +m —2,0..n/Vp +m —Q]. 
Nonoverlap Mapping: The image is decomposed into n IV p 
x n/p blocks. This is done in the same way as the convo- 


lution decomposition. Each processor gets the image block 
that corresponds to its convolution block. 


(2) 


Notice that if overlap mapping is used, then the host must 
transfer more data to each hypercube processor than when the 
nonoverlap mapping is used. However, no interprocessor communi- 
cation is needed when the overlap mapping is used. Interprocessor 
communication is, however, needed when the nonoverlap mapping 
is used. This can take the form of each processor communicating to 
its north, east, and northeast neighbor processors the image values 
they need to compute their convolution. Alternatively, each proces- 
sor can compute the partial convolution values for its north, 
northeast, and east neighbors and then communicate these values. 
In either case, the communication overhead is the same. In our pro- 
grams, we adopt the latter strategy. 


It is also important to note that the communication overhead 
in the template matching problem is small relative to the _comput- 
ing cost. When the overlap mapping is used, O(nmV p + pm’) 
additional data is transmitted from the host to the hypercube 
nodes (i.e., in addition to the transfer of n” image values). However 
since the host can send data to several nodes in parallel, the over- 
head penalty is not as severe. While the same amount of data has 
to be transferred between processors when the nonoverlap mapping 
is used, the p processors can work in parallel so that the transfer 
time is approximately that for the transfer of O(nm/Vp +m’) 
data. In either case, this overhead is expected to be small com- 
pared to the time required for the O(n m*/p) computing to be 
done by each node. 


In each of the three cases listed above, we have assumed that 
the host broadcasts the template to the hypercube processors using 
a tree expansion scheme. 


The NCUBE/7 run times for p=1,4,16, and 64; 
n = 32, 64, 128, 252, and 512 and m =4, 8,16, and 32 for the 
overlap memory mapping are given in Figures 15 through 17. For 
smaller values of p, the template matching can be done only for 
small n as there isn’t enough memory on a hypercube processor to 
hold the convolution and the image subblocks assigned to it. The 
figures show that for the case n =512, m =32, and p =64, the 
run times for the host-to-host case are approximately 2.6% higher 
than that for the hypercube-to-host case and approximately 13.0% 
higher than the hypercube-to-hypercube case. This reflects the cost 
of transmitting the image and the convolution between the host 
and the hypercube. The observed speed up is almost equal to the 
theoretical maximum of p. The speedup and efficiency (speedup /p) 
for n =64 and m =8 are shown in Figure 18. 


Times are in seconds 
m = template size 
n =image size 
p =number of processors 
Figure 15: Overlap Mapping: Host-to-Host 


Times are in seconds 
Figure 16: Overlap Mapping: Hypercube-to-Host 


Times are in seconds 
Figure 17: Overlap Mapping: Hypercube-to-Hypercube 
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seen 100 3.96 11.57 9.12 

ostato-hoat Eon 1.00 0.99 0.72 0.14 
Hypercube-to- | Speedup 1.00 3.82 11.43 14.89 
host Efficienc 1.00 0.95 0.71 0.23 
Hypercube-to- | Speedup 1.00 3.99 15.82 59.23 
hypercube E\ficienc 1.00 0.998 0.99 0.93 

Times are in seconds 


Figure 18: Overlap Mapping: Speedup and Efficiency 
for n =64 and m =8 


The run times for the nonoverlap mapping are presented only 
for the hypercube-to-hypercube case. In this case, there are two 
possibilities: 


1. Overlap of computation and communication between nodes 
nae No overlap of computation and communication between 
nodes 


Our experiments indicate that there is no substantial 
difference in the run times in the above two cases. This is because 
the amount of computation is much larger than the amount of 
communication between nodes. The run times for the nonoverlap 
mapping are given in Figure 19. For small template sizes the 
nonoverlap method is significantly slower than the overlap method. 
For larger template sizes the difference in run time is not so 
significant. Much of the difference in the run time is attributable to 
the following observations: 


1. The program for the nonoverlap case is considerably more 
complex and so has greater overhead than that for the over- 
lap case. 

2. The data transfer rate from the host to the nodes is much 


higher than that between nodes. - 


‘Figure 20 shows the time required by a CRAY-2 supercom- 
puter to perform template matching. These are approximately one 
fifth of the hypercube-to-hypercube times on the NCUBE/7 with 
64 processors. — 
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Times are in seconds 
Figure 19: Nonoverlap Mapping: Hypercube-to-Hypercube 
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Times are in seconds 
Figure 20: Template Matching on CRAY-2 


7. CONCLUSIONS 


In this paper, we have presented optimal algorithms for 1-D 
convolution and image template matching (2-D Convolution) on an 
MIMD hypercube multicomputer. In addition, efficient algorithms 
for Kirsch templates were developed. Also, we have experimented 
with a 64 processor NCUBE hypercube and found that this com- 
puter can perform template matchings for large images and tem- 
plates in about five times the time needed by the CRAY-2 super- 
computer. Thus, the NCUBE has a very good cost-performance 
ratio for this problem. 
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COMPUTATIONAL GEOMETRY ON A HYPERCUBE 
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Abstract. This research focuses on implementing al- 
gorithms to solve basic geometric problems on hypercube 
computers which are recently introduced on the commer- 
cial market. Two solutions for the planar convex hull prob- 
lem are presented, both in O(log? n) time which is the 
best one can expect with existing sorting algorithms. An 
O(log* n) Voronoi diagram algorithm is given. O(log 7) 
solutions for detecting and finding intersection of two con- 
vex polygons, computing minimal distance between two 
convex polygons and finding critical support lines of two 
convex polygons and O(log” n) solutions to the diameter, 
smallest enclosing box, width, minimax linear fit, vector 
sum of two convex polygons, ECDF searching, 2-D and 3- 
D maximal elements, 2-set dominance counting and closest 
points problems are described. Several data communica- 
tions techniques used to solve geometric problems are also 
presented. 


Introduction 


A d-dimensional hypercube computer consists of n = 
2¢ synchronized processing elements (or nodes), linked to- 
gether in a d-dimensional binary cube network. Each 
node has associated a constant size memory. Each 
node m is given a unique d-bit identification number 
(mg-1,..-,71,™M) (henceforth referred as the node i.d.). 
In the lexicographic order of nodes, node and its i.d. num- 
ber are related by m = 24-1my_1 +... + 2m, +mpo. Two 
nodes in a hypercube are said to be neighboring if they 
share a communication link, i.e. iff their corresponding 
i.d.’s differ in exactly one bit position. The notation ®;(m) 
will be used to denote the node m with k-th bit flipped (for 
example, @3(01001) = 00001). The neighbors of a node m 
are exactly @o(m), @1(m),..., @g_1(m). The communica- 
tion diameter of hypercube networks is logarithmic. 

We use a model of hypercubes in which communica- 
tion time is assumed to predominate. We ignore the time 
for start-up and termination and the transfer rate when 
sending messages. Thus, we assume that, in unit time, 
each processor may send at most one message to one of 
its neighbors or perform at most one operation (processor- 
bound model). Using the model described, we solve some 
geometric problems, assuming that we are given one ele- 
ment (point, edge,...) per processor (we suppose that in- 
put/output procedures are done in constant time via a 
processor with large memory connected with all other pro- 
cessors). The next section describes data communication 
techniques used in our solutions. 


Data communication on hypercubes 


Broadcasting. One node has to send the same mes- 


sage to all the other nodes in the hypercube. A O(logn) 
solution is presented in [20]. 

Parallel prefix. Given an array bo,...,bn-1, one ele- 
ment per processor, compute by *b, *...*b; for 1 <1 <n-—-1, 
where ‘x! is arbitrary binary associative operation. We im- 
plement the standard parallel prefix algorithm (cf. [7]) ona 
hypercube, to run in O(log 7) time. Each node m of hyper- 
cube stores three data: t,r and c, where ¢ is initially equal 
to 6,, and finally to bo*b,; *...*b,,. We use 'a ~; b'('b >; a’) 
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to denote sending data b from @;(m)(m, resp.) to node 
m(®;(m), resp.) which receives it and stores as data a. 
':=! is used for assignements made in node @;(m). The 
algorithm runs for each node z in parallel. 

FOR: =0 TO d—1 DOIF «+1 = 0(mod 2**') THEN 

BEGIN r <—;t;t:=rx*t END; 

FOR i = d—2 DOWNTO 0 DO 
IF 2 +1 = 0(mod 27+!) and z 4 2't' — 1 THEN 
BEGIN c¢;t;r —,r;t:=,;7r*t;r:=r*ec END 

Maximum. For 's' being maz the first step of our 
algorithm will report (in node n—1) the maximal element. 

Ranking. Some nodes are selected. The rank of a node 
is the number of selected nodes with a smaller index. A 
O(log n) ranking algorithm has been presented in [15]. Our 
parallel prefix algorithm solves also the ranking problem 
(for '«' being ’+’ and b,, being 0 or 1) with less number of 
data movenment operations than in {15}. 

Sorting. Given an element per processor, the sorting 
can be done in O(log? n) time [4,22]. After sorting the 
elements are kept in nodes in the lexicographic order. 

Merging. Given two sorted arrays A and B each 
stored in a hypercube of size n/2, their merging can be 
done in O(logn) time [22]. We present an iterative and 
simple code of merging procedure from [22]. | 
FORi=0 TO d—2 DOIF z < n/2 THEN cz — @,(z)) 
FOR i= d—1DOWNTO 0 DO 

IF 2; = 0 THEN order(z, @;(z)) 

If the data in z is less than the data in @,(z) then 
z and @,(z) will exchange data as the effect of function 
order(z, ®;(z)). The symbol ' —' is used for passing data 
from one node to the other. 

The cousins of a € A in B are two consecutive ele- 
ments in B so that a is between them in sorted list AU B. 
The cousins in B of each element in A can be determined 
in O(logn) time on a hypercube by merging and inter- 
val broadcasting operations. The ranks of two cousins 61 
and 62 from B for an element a € A are determined by 
r(b1, B) = r(a, AUB)—r(a, A) and r(62, B) = r(61, B)—1, 
where r(e, X ) denote the rank of an element e in the sorted 
set X (the rank of the first element being 0). 

Reversing. The effect of the first step in the merging 


procedure is to reverse data in nodes 0,...,2/2—1. It means 
that a list of data can be reversed in O(log 7) time. 

Distribution. We assume that some nodes m of the 
hypercube store a record r,, and a node destination ad- 
dress h,, such that if i < j then h; < hj. The distribution 
operation consists of routing, for each m the record r,, to 
the node h,,. It can be performed in O(log n) time [15]. 

Translation. Node x has to send a message to the 
node z + s (mod n) concurrently for several nodes z. This 
can be done in O(log 7) time by two distributions (ones for 
nodes with z + s < n and ones for the remaining nodes). 
For s = 1 it gives an access for each node to the data in 
its succesor in the lexicographic order. 

Compression. Some nodes of hypercube contain ”ac- 
tive” elements while others do not. Compress the active 
elements, i.e. store them in nodes 0,1,2,...,s-1 where s is 


the number of active elements. After ranking active ele- 
ments compression became inverse distribution operation 
and a solution is presented in [15]. 

Unmerging. Given a sorted list of elements so that half 
of elements belong to a set A (thus the remaining belong 
to A, the complement of A) and each element knows the 
corresponding rank in A or A, permute the list to return 
each A and A to a hypercube of size n/2. The problem 
can be solved by running the merging algorithm in reverse 
order, or by two compressions and a translation. 

Interval broadcasting. Certain of nodes 0,l,....n —1 
are leaders; they possess data that they must share with 
all the higher numbered nodes, up to but not including the 
next leader (the interval of nodes between two leaders). 
Interval broadcasting can be done in O(logn) time on a 
hypercube ([22, Theorem 6.9], and [15, Theorem 1). 

Many — one routing. Both origin and destination 
nodes have keys, with keys of origin nodes being different 


between each other. Each destination node should receive 
data from the origin with the same key. The problem can 


be solved in O(log” n) time on a hypercube[22], by apply- 
ing sorting and interval broadcasting techniques. 
Pairing elements. Given two sets A and B each con- 


taining ,/n data distributed one per node of a hypercube of 
size n, broadcast these data in such a way that each node 
of the hypercube contains exactly one pair of data (taken 
one from each A and B) and all pairs are distributed. First 
we compress data from A. These data will be stored in 
a sub-hypercube A’ having nodes (0,...,0,24/2-1,---) 0). 
Also we compress data from B and translate them to 
the sub-hypercube B’ having nodes (zqg_1,..., £g/2,0, ...,0). 
Now, we broadcast data from each node of A’ and B' 
to all nodes of a hypercube of size ,/n. As a result, 
each node (aqg_1,..-,29) of hypercube receives a pair of 
data by broadcasting from nodes (0,...,0,24/2-1,-+-) Zo) 
and (rq—1,---,£q/2,0,..., 0). 


Planar convex hull algorithms 


We present two O(log’) solutions for planar con- 
vex hull problem on a hypercube model of computation. 
One uses merging slopes technique (indenpendently used 
in [10,14,18] for solving several problems on mesh comput- 
ers and in [18] for solving all problems mentioned in the 
section on CREW PRAM; the corresponding sequential 
technique is presented in [21,5]) while the other is based 
on CREW PRAM algorithm of [2]. 

Divide-and-conquer is a common strategy to find 
the convex hull H(S) of a set of points S sorted by z- 
coordinate: Partition the points of S into two separated 
sets P and Q of half the size, each stored in a hypercube of 
size n/2, recursively compute H(P) and H(Q) and merge 
H(P) and H(Q) to form H(S) by computing common tan- 
gents of H(P) and H(Q). Two proposed solutions differ 
in the way to merge H(P) and H(Q). 


Merging slopes technique. a-distance of a point to an 


oriented edge p is its distance to the an edge p’ obtained 
by rotating p for the angle a (with distances of points to 
the left (right) of p' being positive (negative, resp.)). 

Let A and B be two convex polygons in the plane, 
each containing O(n) edges given in counterclockwise or- 
der. Given an angle a, consider the following problem 
(we call it the extremal search problem E.S(A, B,a)): For 
each edge p € A find a vertex P € B with the smallest a- 
distance to p among vertices from B (P is associated point 
of p in direction a). It is easy to see that for a=0 (a = 7) 
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P is the vertex with the smallest (greatest, resp.) distance 
from p among vertices of B. For a = 1/2(a = 37/2) P is 
the easternmost (westernmost, resp.) point of p. 

To describe the procedure ES(A,B,a), we first in- 
crease slopes of edges of A by a. The edges with minimal 
slopes in A and B are recognized and by some translations 
they are moved to first nodes of corresponding hypercubes. 
Since slopes of edges of both polygons are then given in 1n- 
creasing order, the sets A and B can be merged (by their 
slopes) in O(log) time. Now sets A, B and AU B are 
sorted and each edge e of A can find its cousins in B, the 
common elements of which is associated point of e. We use 
unmerge technique to return all edges to initial positions. 

In order to to merge H(P) and H(Q), we decide for 
each their edge whether it is an external or internal edge, 
i.e. if it is convex hull edge of H(S) as well. To judge 
if an edge is external, we need to test if H(P) and H(Q) 
are in the same half-plane bounded by the edge. How- 
ever, instead of testing all the vertices of H(Q) with an 
edge e of H(P), we only test two representatives (associ- 
ated points of e) such that if they are in the same half- 
plane bounded by e as H(P), every point in H(Q) is. 


These two representatives for ein H(P)(H(Q)) are nearest 
and furthest extreme points from H(Q)(H(P), resp.) and 
are obtained by calling procedures ES(H(P), H(Q),0), 
HS(H(P),H(Q),), BS(H(Q),H(P),0) and ES(H(Q) 
H(P),7). Now each edge can decide in constant time if it 
is external or not. Then each extreme point of H(P) or 
H(Q) can learn if it is an extreme point of H(S) (trans- 
lation by 1 can be used to find the neccesary data). Two 
of them in both H(P) and H(Q) share an external and an 
internal edge. These four points determine two common 
tangents of H(P) and H(Q). Then the computation of the 
circular edge list of H(S) can be done in O(log 7) time by 


some translations. 
he time complexity of all procedures in merge step 


is O(logn). Because of O(log 7) recursive calls, the overall 
time complexity of presented algorithm is O(log” n).- 

Using the merging slopes technique the diameter, 
smallest enclosing box, width and minimax linear fit of 
a set of n points and vector sum and critical support lines 
of two convex polygons (see [21,5,16,18,19] for definitions) 
can be found on a hypercube. The details of these solutions 
are presented in a full version of the article [19]. 

Another convex hull algorithm on a hypercube can be 
derived by using Atallah and Goodrich [2] CREW PRAM 
solution. The main point in the algorithm [2] is to divide P 
and Q into ./n equal portions by considering ,/n vertices 
and, by examining each pair of considered vertices, to find 
common tangent of polygons of size ,/n obtained in this 
way. Then one of polygons P or Q can be reduced to size 
/n and, in one more iteration, the common tangent of P 
and Q will be constructed [2]. The pairing elements and 
the broadcasting (to construct the tangent passing through 
a point) techniques are applied. A similar approach was 
used in [6] to solve the convex hull problem in O(log? n) 
time on CREW PRAM and cybe connected cycle models. 
The later one is directly implementable on hypercube in 
O(log? n) time. 

The problem of computing minimal distance between 
two convex polygons involves similar techniques, and is 
solved in [2] for CREW PRAM model of computation. 
Using pairing elements and other techniques a O(log n) 
hypercube solution can be obtained. 


Planar point location and Voronoi diagram 


In order to locate O(n) points into the planar sub- 
division defined by O(n) edges we use the chain method 
described by Lee and Preparata [11], a parallelization of 
which for mesh-connected computers is given in [12]. We 


slightly modify both methods in order to get an O(log’ n) 
planar point location algorithm on a hypercube. 

First we sort regions by z-coordinate of selected in- 
terior points (called centers). Then a monotone complete 
set of chains is defined as in [11,16,12]. These chains are 
nodes of a balanced binary tree the leaves of which corre- 
spond to regions of subdivision. Each chain has its level 
and index (the rank of the chain in the chains of given 
level). Chains may share common edges. If an edge e be- 
longs to more than one chain it belongs to all members of 
a set (an interval) of consecutive chains. We assign e to 
hierarchically the highest chain to which e belongs. The 


level and index of the chain is determined in constant time 
by the rule described in [12]. Now we sort all edges by 


their level as the primary key, their index as the secondary 
and the y-coordinate of the endpoint of edge as the ternary 
key (endpoint with less y-coordinate among two endpoints 
of an edge is chosen). Also, we sort all query points by 
their y-coordinates. Initially, all query points are assigned 
highest level [logn] and index 0. Then, for each level z, 
from i = [logn] to i = 0 do the following: 

(i) Merge the set of edges and query points (note that 
all query points have the same level, equal to 7), 

(ii) Perform interval broadcasting to find, for each 
query point Z, the corresponding edge e the query point 
should be discriminated against. If the y-coordinate of Z 
is not between y-coordinates of endpoints of e then Z has 
been discriminated at level before. Depending on which 
side of e the query point Z is, Z calculates the index of 
chain at the next level it should be discriminated, 

(iii) Unmerge edges and query points (using former 
indices of query points), 

(iv) Re-sort query points by new indices, by compress- 
ing query points with answer "left” of corresponding edge 
in Step (ii) (query points with answer right” will be also 
compressed) and (since both subsets of query points are 
sorted by new indices after compressing) merging ”left” 
and ”right” query points by their new indices. Give next 
level to all query points. 

All query points will be located in the Step (ii) when 
1 = 0. Since all steps (i)-(iv) take O(log n) time, the time 
complexity of planar point location algorithm is O(log? n). 

An O(log* n) algorithm to construct Voronoi diagram 
of a set S of n planar points on a hypercube with n proces- 
sors can be obtained by using Jeong and Lee [10] algorithm 
to solve the problem on mesh-connected computers, planar 
point location technique and presented data communica- 
tion techniques. In [19] it is shown that all operations in 
the merge step of the algorithm can be implemented in 
O(log? n) time. 

Finding intersection of two convex polygons 


We give a parallel algorithm for finding the intersec- 
tion of two convex polygons P and Q with O(n) vertices 
all together by modifying the sequential method of [17]. 

By drawing a vertical line through each vertex of P 
and @ we divide P and Q into slabs. The leftmost and 
the rightmost vertices of P and Q (they can be found 
in O(logn) time) divide both P and Q into two chains: 
the upper and the lower chain of vertices (denote them 
up,lp,ug and lg respectively). The intersections Al, A2 
and A3 of a vertical line passing through a vertex A of a 
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chain with remaining three chains can be obtained in paral- 
lel (one processor per each vertex A) by computing nearest 
points A’ and A” of A to the left and to the right respec- 
tively in the considered chain and finding the intersection 
of A'A" with vertical line through A. Clearly A' and A” 
are the cousins of A in the chain. Upper and lower chains 
of P and Q can be formed by some translations and rever- 
sions from P and Q. Then desired intersections can be ob- 
tained by merging upUug,upUlg,upUlp,lpUug, lpUlg, 
and ug Ulg and unmerging between two steps. Let PUQ 
denote the list of vertices of P and Q sorted together by 
z-coordinate (it can be constructed in O(logn) time by 
merging upper and lower chains of P and Q). Consider 
each slab defined by two neighboring points A and B. On 
the basis of the coordinates of points A, Al, A2, A3, B, 
B1, B2 and B3 (translation by 1 can be used to exchange 
the data) one can decide in O(1) time in parallel whether 
P and Q intersect within the slab and determine (at most 
two) points of PM Q which are located in the slab (these 
are either intersections of edges of P and Q or vertices of 
P or Q which are located inside the other polygon). So 


far we have detected all vertices of intersection of P and 
Q in O(log 7) time. However, we should order them to ob- 


tain their convex hull. For each vertex of intersection we 
decide whether it is an vertex of upper or lower chain of 
PQ (this can be done in constant time). Also, we assign 
the vertex to left vertex of corresponding slab defined by 
points of PU Q. Thus each vertex of P or Q will have as- 
signed zero or one vertex of PM Q from upper convex hull 
chain (and similarly for lower convex hull chain). Now, up- 
per chain of PM Q can be obtained by simply compressing 
points of P UQ, assuming that active points of PUQ are 
those having assigned a vertex of PM Q. Similarly we find 
lower convex hull chain and finally order vertices of PQ 
by some translations and reverse steps. 


This algorithm solves also the problem of detecting 
intersection of two convex polygons in parallel (linear sep- 
arability). Clearly they intersect if at least one vertex of 
PQ is found. The time complexity is still O(log 7). 


The described algorithm can be also implemented in 
optimal time on a mesh computer and in O(log n) time on 


a CREW PRAM. 
ECDF searching problem 


Given a set S = {pj,...,pPn} of n points in 2- 
dimensional space. A point p; dominates a point p;(p; > 
p;) iff p,[k] > p;[k] for k = 1,2, where p/k] denotes the 
k-th coordinate of a point p. The 2-dimensional ECDF 
searching problem consists of computing for each p € S$ 
the number D(p, S) of points of S dominated by p. 

Let the rank B(p,S) of a point p in the set S con- 
taining n points be the position of p in the set S sorted 
according to the y-coordinate of points, the rank of bot- 
tommost and uppermost points being 0 and n — 1, resp. 

As a preprocessing step of ECDF searching algorithm, 
we sort points by z-coordinate. The rest of algorithm 
is best described recursively. Suppose S$ is divided into 
two subsets LZ and R of equal size with J[1] < r[1] for 
alll € L andr € R, both sorted by y-coordinate. Af- 
ter the recursive calls for L and R in parallel we will 
have D(l, L), D(r, R), B(I, L) and B(r, R) for all 1 € Land 
r € R. The main point is that the number of points from 
L which are below r € R is max{B(l, L)|I[2] < r[2]} = 
B(r,S) — B(r,R). Therefore the final result will be ob- 
tained directly from the relations: 

D(1, S) := DU, L)for alll e L, | 

D(r, S) := D(r, R)+ B(r, S)— B(r, R) for all re R. 


Unfolding resursion yields the iterative solution. Ini- 
tially B = D = P = 0 for each node z of hypercube. 
FOR21=0 TO d—2 DO 

BEGIN 


MERGE consecutive blocks of size 2* in pairs; 
IF zc; =1 THEN D:=D+P-—B; 
BP 


END 

In the merge procedure values B,D and P are ex- 
changed whenever data are exchanged between nodes. The 
ranks of elements after merging are denoted by P and are 
easily obtained as the relative node’s i.d. in the corre- 
sponding block of size 2**}. 

The running time of merging step is O(logn) which 
give a total O(log” n) time for ECDF searching problem. 

The same algorithm solves also the maximal elements 
problem, i.e. the problem of determining points which are 
dominated by no other point. We replace the sign > by < 
in the definition of domination and look for points p with 
D(p, S) = 0. Maximal elements can also be determined 
directly by sorting and parallel prefix (with * = maz) op- 
eration, as suggested in [3] for CREW PRAM model. 

The 2-set dominance counting problem (computing for 
each point from A the number of points from B dominated 
by the point) and the maximal element problem for point 
sets in three-dimensional space can be solved on hypercube 
in O(log? n) time by a similar iterative algorithm, using 
labeled functions from [3]. 


Closest points problem 


A O(log?) hypercube solution to the problem of 
computing two points with the smallest distance among 
m given points based on a sequential method presented in 
[16] will be described. We again apply iterative approach 
rather than recursive one. 

First we sort n points from given set S of points by z- 
coordinate. At a stage z (where 7 ranges from 0 to d—1), let 
Land R be left and right halves of points in a given block 
of size 2°, respectively. Suppose Z and R are both sorted 
by y-coordinate, and 6;(62) (the smallest distance between 
points in L(R, resp.)) are found. Let active elements of $ 
be those with distance from a line separating L and R less 
than 6=min(6,, 62). Compress a copy of active elements 
in both ZL and R and merge them to form list S' of active 
elements. Each active element from L(R) should calcu- 
late its distance to constant number of active elements in 
R(L) (at most six, as shown in [16]). By repeating inter- 
val broadcasting technique constant number of times (six) 
we inform each active element about neighboring elements 


in other set. Then active elements choose the nearest ele- 
ment from other set and minimum over obtained distances 
is found and compared to 6. Now broadcast new value of 


6 and merge L and R for the next stage. 
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Abstract 

Given n points chosen uniformly and independently from the unit 
square, it is shown that a parallel random access machine (PRAM) 
with n processors can solve several geometric problems in constant 
expected time, achieving linear speedup. The PRAM is assumed to be 
synchronous, with concurrent read and “collision detecting” write, 
where if two or more processors write to the same memory location 
simultaneously then the memory value becomes “collision”. Problems 
solvable in constant expected time include determining for each point 
whether it is an extreme point of the convex hull, determining for each 
point if it is dominated by any other points, determining for each domi- 
nated point a maximal point that dominates it, finding the closest pair 
of points, and finding the furthest pair of points. These results extend 
to points chosen uniformly from the unit cube in d-dimensional space, 
and to many nonuniform distributions. 


1. Introduction 


It has well-known that synchronous concurrent read, concurrent 
write parallel random access machines (CRCW PRAMS) are strictly 
more powerful than most other parallel computers. For example, ann 
processor CRCW PRAM can determine the minimum of n numbers in 
(log log n) time [Val], while it is easy to show that on a PRAM with 
either an exclusive read (ER) or an exclusive write (EW), or on a dis- 
tributed memory machine, at least Q(log n) time is needed. However, 
Valiant's results do not really need the full power of concurrent writes, 
and we show that a weaker property, here called detecting write (DW), 
can solve many problems equally rapidly. DW is intermediate between 
CW and EW, in that if two or more processors write to the same mem- 
ory location at the same time, then the value becomes “collision”, no 
matter what values were being written. Valiant's approach can be util- 
ized on a CRDW PRAM and still finish in only @(log log n) time. 

It is also well-known that some geometric problems involving data 
known to be chosen from a uniform distribution can be solved faster, 
in the expected case, then the same problems for arbitrary data. For 
example, m points chosen uniformly and independently from the real 
interval [0,1] can be serially sorted in @(n) expected time, as opposed 
to the Q(n log n) expected time for comparison-based sorts. Given n 
points chosen uniformly and independently from the unit square in 
2-space, the convex hull and nearest neighbor of each can be deter- 
mined by a serial computer in @(n) expected time [BeSh, BWY], as 
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opposed to the Q(n log n) time needed for arbitrary planar data sets 
[Yao]. 

This paper shows that by combining the synchronous CRDW 
PRAM model with use of randomization in the algorithms and data 
sets generated randomly via uniform distributions, several geometric 
problems can be solved in constant expected time. Note that the best n 
processor CRCW PRAM algorithm known for sorting n points chosen 
uniformly from [0,1] takes @(log n) time, i.e., it is not known how to_ 
sort such data sets any faster than arbitrary data sets. Since determin- 


ino the extreme noints of the convex hull of a 
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set of points in the plane 


requires as much time as sorting [Yao], this would seem to imply that 
it takes Qdlog n) time for a CRCW PRAM to determine which points 
are extreme points, even when the data is generated from a uniform 
distribution. However, the proof that determining extreme points is as 
hard as sorting holds only for a worst-case analysis, and we will show 
that the extreme points can be determined in constant expected time in a 
CRDW PRAM with a linear number of processors. To the best of our 
knowledge, these are the first constant expected-time algorithms for 
these problems on any parallel machine with only a linear number of 
processors. 

Because of length limitations, results will be given with only 
sketches of their proof. The algorithms involve doing preliminary 
work which, with high probability, reduces the data set to a small 
number of points remaining to be considered. For these points, the 
processor to point ratio is very high and significantly different tech- 
niques can be utilized. However, the remaining points must be first 
moved together into a small array so that all the processors can locate 
them and help in the processing. Usually the points would be packed 
into the initial positions of the array, but this would take more than 
constant time. Therefore the array is made larger than the expected 
number of points remaining, and the points are mapped to random 
locations. The array must be large enough so that with high probabil- 
ity no two points are mapped to the same location, but it must also be 
small enough so that the processor to array size ratio remains suffi- 
ciently high. 

The algorithms also have the property that if a step is reached 
where something undesired happens, then they resort to a standard 
worst-case polylogarithmic time CREW PRAM algorithm. Since this 
happens with very small probability, the expected time remains ©(1). 

Throughout, no attempt has been made to optimize constants. 


2. Results 


The phrase randomly chosen points will mean points chosen inde- 
pendently and uniformly on the unit square [0,1] x [0,1] in Euclidean 
2-space. Algorithms will also require that processors (PEs) generate 
pseudo-random integers in given intervals. It is assumed that these 
can be computed in constant time, and that they are uniformly and 
independently distributed on the interval. Distance will be measured 
with the Euclidean metric, though any other L,-metric could be used. 
A point in a finite set S is an extreme point of S if it is one of the cor- 
ners of the smallest convex polygon containing S$. The point (x,,y) 
dominates the point (X>,y) if x; 2x and y;2y>. A point is maximal 
in a set if it is not dominated by any other point in the set. 

The following lemma is used whenever many points have been 
eliminated from further consideration, and those remaining must be 
compressed into a small array so that processors can find them. If the 
expected number of points remaining is k, then the array will be at 
least of size k*. Each processor holding such a point must find a place 
in the array to put the point. 


2.0.1 Lemma On aCRDW PRAM of k processors, in ©(1) time 
each can probably be allocated a unique position in an initialized inte- 
ger array of k positions, with probability of failure o(k 0.5), 

Sketch: Suppose each position of the integer array is initialized to -1. 
Each PE writes its ID (a unique positive integer) to a random array 
position, and then reads that position. If it reads its ID then that is its 
allocated position, while otherwise a conflict occured. To determine if 
any processors experienced conflicts, PE 0 writes "false" to a boolean 
variable problems. Next, if any PE experienced a conflict it writes 
"true" to problems, while otherwise it pauses. Now all PEs read 
problems, and are finished if and only if it is false. Otherwise (1.e., 
if it is true or conflict), another round is repeated by those PEs without 
allocated positions, where now each such PE first reads the location it 
picked and does not write to it if it is already allocated. The process is 
repeated 3 times. One can show that the probability that some PE still 
has not been allocated a position is o(k 0.5), 


The proof of the lemma can be extended to show that for any a> 1 
and b> 0, there is a constant C(a,b) such that k processors can be allo- 
cated a position in an array of k@ positions in C(a,b) iterations, with 
probability of failure o(k ~). 

2.1 Maximal Points and Extreme Points 

The following lemma is based on a modification of Valiant’s observa- 
tion that a synchronous CRDW PRAM of k2 processors can determine 
the maximum of k values in constant (worst-case) time. The reason 
for including “Not a Point” as a value is because later algorithms will 
place a few points into a large array, and hence many entries will not 
correspond to points. 
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2.1.1 Lemma In a CRDW PRAM of 7 processors, suppose each 
entry of a global array p[0..n!/3-1] contains a point or the value NAP 
(Not A Point). Then in constant worst-case time the maximal points 
can be determined, and for each nonmaximal point a maximal domin- 
ating point can be determined. 

Proof: Let s=n/3, and assume that arrays maximal:[0..s-1] of boolean 
and dominator:[0..s-1] of point are initialized to true and NAP, respec- 
tively. When finished, maximall[i] is true if and only if p[i] is a maxi- 
mal point, and if p[i] is a nonmaximal point then dominator[i] is one of 
its maximal dominators. Each PE executes the following algorithm, 
where i represents the index of the PE (0 <i<n-1), a local variable 
which is already initialized. Other local variables in each PE are il, i2, 
and 13. A temporary, global boolean array T:[0..s-1, 0..s-1] is also 
used. Throughout, whenever conditional instructions occur where 
some PEs may take one branch and others take the other branch, it is 
implied that pauses are inserted so that all PEs complete each branch in 
the same time. 


1. Read pfi], and if it is NAP then write false to maximal[i]. 

2. Let il=i div s?, i2=(i div s) mod s, and i3=i mod s. 

{ Notice that for each i1,i2,i3 triple with 0 < i1,i2,i3 <s-1, 
there is exactly one PE with that triple}. 

3. If i3=0 then write true to T[il,i2]. {At the end of step 5, 
T[il,i2] will still be true only if p[i2] dominates p{i1].} 

4. Read p[il], p[i2}, and p[i3]. If p[il] or p[i2] are not points, 
or if p[i2] does not dominate p[il], then write false to 
T[il,i2] and go to 6. 

5. Otherwise, if p[i3] is a point and p[i3] dominates p[il] and 
i3<i2 then write false to T[i1,i2]. {This signals that, even 
though p[i2] could be used to show that p[il] is not maximal, 
there is a dominating point of smaller index and p[i2] should . 
not be used. It doesn't matter whether T[il,i2] ends up with 
the value false or ‘‘collision” }. 

6. If i3=0 then read T[il,i2], and if it is true then write false to 
maximal{il }. 

7. {At this point, maximal is correctly determined for all 
positions. Now for each nonmaximal point we locate the 
maximal dominator of minimal index. These steps are 
similar to steps 3-6} 

If i3=0 then write true to T[i1,12]. 

8. Read maximal[i2] and maximal[i3]. If p[il] or p[i2] is not a 
point, or if p[i2] does not dominate p[il], or if p[i2] is not 
maximal, then write false to T[il,i2] and go to 10. 

9. Otherwise, if p[i3] dominates p[il], p[i3] is maximal, and 

i3<i2 then write false to T[il,i2]. 

If i3=0 then read T[i1,i2], and if it is true then write p[i2] to 

dominator[il }. 


10. 


Since each step takes constant time, the algorithm finishes in constant 
time. 


Figure 1 


2.1.2 Theorem On a CRDW PRAM with n processors, given a 
set of n randomly chosen points, in constant expected time it can be 
determined which points are maximal. Further, in constant expected 
time, for each dominated point one of the maximal points dominating it 
can be determined. 

Sketch: A sequence of steps is used to continually reduce the number 
of candidate extreme points for the next step. First it is determined if 
there are any points in the "corner" [1-n-9-45,1] x [1-n°9-45,1]. With 
probability close to 1 there are some, but not more than n9-!!_ If there 
are points in the corner, then they are moved to an array of size n3, 
and the maximal points are determined. In this set, the points with 
greatest x-coordinate and greatest y-coordinate are put in prespecified 
locations. Every PE i reads them and determines if the i*” point is 
dominated by them. If so the point is marked as not maximal, and 
dominator is set. 

For those remaining there are two groups: those in section A and 
those in section B of Figure 1. These are treated similarly, so only A 
will be discussed. Now it is determined if there are any points in A 
with x-coordinates in [1-n-95,1-n-9-45]. With high probability there 
are some, but not more than n°-!!_ These are also moved to an array 
of size n!/3 and the maximal points determined. Then the point with 
the largest y-coordinate of this set is read by all PEs corresponding to 
points not yet dominated, and if they are dominated by this point they 
set maximal and dominator to appropriate values. This leaves a set A’ 
as in Figure 2. 

The process is repeated using regions with x-coordinates in the 
ranges [1-n995,1-n-5], [1-n-9-6,1-n- 955], ..., [0,1-n- 95]. With 
probability o(n-9-°1) each step is completed successfully. If any step 
does not complete successfully (i.e., either there are no points in the 
region, or else the PEs corresponding to the points cannot be allocated 
a position in the array of size n!/3 in constant time) then all PEs revert 
to using a deterministic CREW PRAM algorithm taking @(log n) 
worst-case time [AtGo]. The total expected time is @(1). 


A similar approach can be used to determine extreme points of the 
convex hull of the points, starting at the corners and working inwards. 
In case of failure at some step, CREW PRAM algorithms which finish 
in polylogarithmic worst-case time [ACGOY, AtGo, MiSt] are used. 
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¢ Maximal point 


Figure 2 


2.1.3 Theorem On aCRDW PRAM with n processors, given a set 
of n randomly chosen points, in constant expected time it can be deter- 
mined which points are extreme points. Further, in constant expected 
time, for each point which is not extreme, three extreme points which 
contain the point in the triangle they form can be determined (or two 
can be determined, if the point is on the boundary of the convex hull). 


Let E denote the number of extreme points. One can show that the 
expected value of E is @(log n) [ReSu], which implies that, for any 
integer k, with probability close to 1 the ratio n/E is QuE*). Using 
this, in constant expected time one can apply algorithms which 
examine all possible combinations of k of the extreme points. This 
easily yields the following. 


2.1.4 Corollary On a CRDW PRAM with n PEs, given a set of n 
randomly chosen points, in constant expected time the maximal dis- 
tance between any pair of points can be determined, and enclosing rec- 
tangles and circles of minimal area can be determined. 


2.2 Closest Pair 


A significantly different problem is to determine the closest pair of 
points. For this problem there is no immediate technique to eliminate 
points, since even if n-1 points are known it is not possible to deter- 
mine the closest pair without knowing the last point, and the closest 
pair might consist of the last point and any of the previously examined 
points. However, one can reduce the expected number of pairs for 
which the distance must be determined, by partitioning the unit square 
into subsquares of edgelength L. If it is known that at least one square 
has two points in it, then the answer is known to be no more than 2 
L. In this situation, if a point is in square S in Figure 3, then it may be 
part of the closest pair only if there are points in S or the 20 nearby 
squares. By choosing L to be n-99, then with probability close to 1 
there is a square with at least 2 points, no square has 4 or more points, 
and the number of squares with 2 or more points is o(n®-1), 


Figure 3. The 20 neighbors with points within V2 L of S. 


Using this fact, first points are written to their squares. All points 
in squares with collisions, or which are in one of the 20 squares near a 
square with collisions, are written to a new array of size n°-2, The 
closest pair within this new array is determined, and also the closest 
pair involving a square and one of its 20 nearby squares, where all 21 
have at most one point in them. The closest pair for the original set is 
the closer of these two pairs. As before, if any steps cannot be com- 
pleted properly, then the PEs resort to using a polylogarithmic CREW 
PRAM algorithm [AtGo]. 


2.2.1 Theorem On a CRDW PRAM with 7 processors, given a set 
of n randomly chosen points, in constant expected time a closest pair 
can be determined. 


3. Final Remarks 


This short preliminary version of the paper sketches some ways to 
accomplish constant time algorithms using random data on a synchro- 
nous CRDW PRAM. To the best of my knowledge, these are the first 
constant expected time algorithms for these problems on any parallel 
model using a linear number of processors. It seems that the CRDW 
PRAM is the "weakest" parallel computer which can solve these prob- 
lems in constant time using only a linear number of processors, but it 
is not clear how to formalize this properly, let alone prove it. 

Many existing algorithms for CRCW PRAMS can be modified to 
work in the same time on CRDW PRAMS. This occurs because the 
only concurrent writes they utilize have the property that whenever two 
PEs are writing to the same memory location at the same time, then 
they are writing the same value. Such algorithms can often be con- 
verted to a scheme as in Lemma 2.0.1, where “conflict” is as useful as 
the value that was being written. However, not all CRCW PRAM 
algorithms are of this form, and more work needs to be done to under- 
stand which problems can be solved faster on a CRCW PRAM than on 
-a CRDW PRAM. 
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The algorithms given herein extend quite naturally higher dimen- 
sional data. For any fixed dimension d, given random points chosen 
from the d-dimensional unit cube, an n processor CRDW PRAM can 
solve the d-dimensional domination, extreme points, furthest pair, 
smallest enclosing box, smallest enclosing sphere, and closest pair 
problems in constant expected time. 

The algorithms can also be easily extended to many nonuniform 
distributions, such as the d-dimensional normal distribution. How- 
ever, some distributions will cause difficulties because the number of 
maximal or extreme points may be too large to permit the use of 
techniques which assume a high processor/point ratio. For example, 
using the uniform distribution on the unit sphere will result in @(nl/2) 
extreme points, on average [Rayn], rendering the approach of 
Corollary 2.1.4 invalid. 
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Abstract 


A parallel implementation of a 9-state square-root 
extended Kalman filter for target tracking applica- 
tions on the Warp computer is described. The Warp 
computer is a linear array of ten powerful and pro- 
grammable cells with a maximum performance of 
100 million floating-point operations per second (100 
MFLOPS). The Kalman filter uses numerically reli- 
able square-root filtering algorithms to estimate the 
position, velocity and acceleration of a maneuvering 
target from noisy radar measurements at high data 
rates. The computations include matrix multiplica- 
tion, matrix triangularization, coordinate transfor- 

mation, and Jacobian transformation. We describe 
the current implementation of the Kalman filter and 
compare its performance on Warp to a variety of ma- 
chines. 


I. Introduction 


Kalman filtering is a general technique for recursively esti- 
mating the state variables of a dynamic system from noisy 
measurements. Applications of the Kalman filter include 
spacecraft orbit determination, target tracking, image pro- 
cessing, economic forecasting and industrial process con- 
trol. 

This paper describes a parallel implementation on the 
Warp computer of a 9-state square-root extended Kalman 
filter for subsonic aerial target tracking applications. The 
filter estimates, from noisy measurements, the state (posi- 
tion, velocity, and acceleration in each of the x, y, and z 
coordinates) of a maneuvering aircraft. 

We chose Warp [1] as the target machine for a num- 
ber of reasons. First, we had access to a working machine. 
Second, linear arrays such as Warp can be highly scalable. 
Third, Warp is specifically designed for applications (such 
as Kalman filters) where the ratio of floating-point com- 
putations to inputs and outputs is large. Fourth, Warp is 
programmed in a high-level language and has a rich pro- 
gramming environment. Finally, Warp will be introduced 
as an Intel product in 1990; the programs we write and the 
lessons we learn today using the current Warp will be di- 
rectly applicable to this new, smaller, more powerful, and 
less expensive machine. 
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Sections 2 and 3 give background information on 
Kalman filtering and the Warp computer. Section 4 charac- 
terizes the computations of the Kalman filter as a directed 
graph, describes the mapping of the nodes of this graph 
onto the Warp cells, and compares the performance of the 
Warp filter to the performance of an identical filter runnin 
on a Vax 11/780, a Sun-3, and a Cray-2. | 


Il. Kalman Filter 


A discrete-time equation of target motion can be expressed 
by the following state variable model [2]: 


x(¢+ 1) 
y(t) 


(At, a)x(t) + w(t) 
h(x(t)) + v(t) (1) 


t= 0,1,2,... 


where t is the normalized discrete-time; At denotes the 
sampling; x(t) is a nine-dimensional state vector with po- 
sition, velocity, and acceleration in each of the Cartesian 
coordinate axes x, y, z. The first-order Markov parameters 
of stochastic acceleration models in x, y and z axes are de- 
noted by a (3 x3) matrix a = Diaglaz,ay,a,]. The (9x9) 
state transition matrix ®(At, a) is given by 


I (At)l 1/2(At)?I 
@(At,a)= | 0 I (At)I (2) 
0 0 a 


where I denotes the (3 x 3) identity matrix. The state 
vector x(t) is represented by 


x(t) =[z,y,2, 4,9, 2,2, 9, 2)" (3) 


where T is the transpose operation. The (3 x 1) vector 
h(x(t)) is the transformation from the Cartesian coordi- 
nates to the polar coordinates defined by 


a? + y? + 2? 


a = 
= 1/4 
67 = tan (—) 
en i sin“ (=) (4) 


where r, 97, and 0g denote the target range, azimuth angle, 
and elevation angle, respectively. The measurement vector 


y(t) = [r, Or, 5)" (5) 


contains the noisy radar measurements of range, azimuth, 
and elevation angles. 

It is assumed that the state and the measurement noise 
sequences {w(t)} and {v(t)} have the properties: 


E{w(t)} = 0; E{v(t)} =0 

E{w(t)w"(r)} = Ri(t)Ser 

E{v(t)v"(r)} = Ra(t)éer 
E{w(t)v"(r)} =0 


(6) 
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where 6,, denotes the Kronecker delta function and E de- 
notes the expectation operator. The elements of the state 
noise {w(t)} are assumed uncorrelated in x, y, z axes. The 
radar measurement errors in the target range, azimuth an- 
gle, and elevation angle are assumed uncorrelated and may 
depend on the target range. Let 0%, 03, and o3,, denote 
the variance of the measurement noise in range, azimuth 
angle, and elevation angle, respectively. The (3 x 3) mea- 
surement noise covariance matrix R2(t) is given by 


(7) 


The dynamics and the measurement models presented 
above are commonly used in many tracking and naviga- 
tion applications. The extended Kalman filter recursively 
estimates the (9x1) state vector x(t+1) using the following 
equation: 


Ro(t) = DiagloR, o67, 76). 


R(t +1) = X(t) + K(t){y(t +1) — h[Sx(t)]} (8) 


where X(t) denotes the state estimate based on the mea- 
surements y(t). The (9 x 3) gain matrix K(t) is given by 


K(t) = P(t + 1/t)H?{H,P(¢t + 1/t)H7 + Ro(t)} (9) 


where P(t+1|t) denotes the (9 x 9) covariance matrix of the 
filtered error at time t+ 1 before processing measurements 
y(¢+1). The extended Kalman filter recursively computes 
the (9 x 9) estimation error covariance matrix using the 
following equations: 


@P(t|t)b7 + Ri (t) 
(I— K(t)Hz)P(¢ + 1{t) 


P(t + 1|t) 
P(t +1|f+1) 


(10) 
(11) 


The (3 x 9) matrix H, represents the Jacobian matrix of 
h(x) evaluated at X(t). 


H, X(t) (12) 
The square-root extended Kalman filter implemented 
on Warp is a variant of Equations (8)-(12) that :nanipu- 


lates a factored form of the covariance matrix P |3]. The 


_ Oh(x) 7 
Ox i 


109 


Sun-3 
Master 


Output Cluster 


Input Cluster 


OO eee 


Warp Array 


Figure 1: Architecture of Warp 


motivation for using a square-root filter is to improve nu- 
merical accuracy by reducing the dynamic range of the 
numbers in the covariance matrix. Loosely speaking, a 
square-root filter that uses 32-bit floating-point arithmetic 
provides the same accuracy as a conventional filter that 
uses 64-bit arithmetic. This was important for our applica- 
tion because the Warp performs only 32-bit floating-point 
arithmetic. 

Measurement samples for the Warp Kalman filter were 
obtained from a simulated benchmark trajectory [2] where 
a Maneuvering target makes a high-g turn past a stationary 
radar. The target starts at an initial range of 2000 meters, 
and approaches at a velocity of 400 knots at an altitude of 
50 meters. Samples of the simulated trajectory were taken 
every 20 ms. and noise samples with respective standard 
deviations of 15 meters, 2 milliradians, and 3 milliradians 
were added to the sampled range, azimuth angle, and ele- 
vation angle to generate the noisy measurement samples. 


III. The Warp Computer 


The Warp computer [1] (or simply Warp) is a linear 
array of 10 or more identical and programmable cells con- 
nected to a general-purpose host. Warp is designed for 
computationally intensive applications such as image pro- 
cessing [5], and scientific computing [6]. 

Figure 1 shows the architecture of the current 10-cell 
Warp. A Sun-3 workstation called the master is connected 
via a VME repeater to a pair of clusters. Each cluster 
consists of an MC68020 processor(P) with an MC68881 
floating point coprocessor, 3 Mbytes of data memory(M), 1 
Mbyte of program memory, and a switch(S), all connected 
by a local VSB bus. The clusters are known collectively 
as the external host, so called because they are external to 
the Master. 

The Warp array is a linear array of 10 programmable 


cells, each containing a pipelined floating-point adder and 
multiplier, 8K instruction words, and 32K data words. The 
clock cycle of each cell is 200ns. Each functional unit can 
emit one result per cycle, for a maximum performance of 
5 MFLOPS per functional unit, 10 MFLOPS per cell, and 
100 MFLOPS for the 10-cell Warp array. Each cell is con- 
nected to its nearest neighbor by two 32-bit data channels 
(X and Y). Data flows from left to right along the X chan- 
nel; the Y channel can be statically reconfigured to pass 
data in either direction. Each cell can transfer up to 20 
million 32-bit data words to its neighboring cells each sec- 
ond. 

W2 is an Algol-like language with send and receive 
statements for cell/cell and host/array data transfers. The 
W2 compiler inputs a W2 source file, generates microcode 
for the Warp array and IU, and generates C code for the 
clusters. WPE is a programming environment that pro- 
vides a programmable shell for interactively running and 
debugging W2 programs, as well as a set of lower-level rou- 


tines that allow user-written C and Lisp programs to access 
Warp. 
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iV. iImpiementation 

In this section, we characterize the computations per- 
formed by the Warp Kalman filter, describe the implemen- 
tation of these computations on Warp, and compare the 


performance of Warp to other machines. 


Kalman Filter Computations 


For our purposes, a computation graph is a directed acyclic 
graph consisting of a starting node, s, which has no input 
edges, an ending node, e, which has no output edges, and 
m computation nodes, {v1,...,Um}, which represent a par- 
tial ordering of computations. Edges are labeled with an 
integer number of data items. An edge (v;,v;) labeled with 
d indicates that node v; produces d data items which are 
consumed by node v;. Edge (v;,v;) also represents a weak 
precedence relation between nodes v; and v; in that node 
vj is guaranteed not to complete its computation until node 
v; has produced all d data items. 

The computations required to process one sample can 
be characterized at a coarse level by the computation 
graph, G, in Figure 2. Node v1 computes state propa- 
gation ®X(t) in Equation (8). Node v2 computes the Ja- 
cobian transformation of Equation (12). Node v3 com- 
putes a factored form of the right hand side of Equation 
(10), as described in [3]. Node v4 computes the Carte- 
sian to polar coordinate transformation, h[®X(t)], in Equa- 
tion (8). Node vs computes a factored form of the co- 
variance matrix P(¢ + 1|t) in Equation (10) by triangular- 
izing the matrix produced by node v3 using a modified 
weighted Gram-Schmidt orthogonalization technique de- 
scribed in [3]. Node vg computes a factored form of the 
covariance matrix P(t + 1|t + 1) in Equation (11) and the 
updated state vector X(t + 1) in Equation (8) using the 
scalar measurement update technique described in [3]. 


V5 
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Figure 2: Kalman Filter Computation Graph 
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Figure 3: Warp Kalman Filter Operation Counts 


Figure 3 lists the number of floating-point additions, 
floating-point multiplications, inputs, and outputs for each 
node in G. Note that the bulk of the computation occurs 
in nodes v3, vs, and ve and that these nodes are totally 
ordered. 


Mapping on Warp 

The current implementation of the Warp Kalman filter is 
700 lines of W2 code. It uses the extremely simple mapping 
shown in Figure 4. We start with a topological ordering, 
T =< v3, U5, U1, U4, V2, 06 >, of the computation nodes in 
G. (Recall that a topological ordering of a directed acyclic 
graph is a total ordering, < s1,52,...,3m > such that an 


edge from s; to s; implies that 2 < gj. Further, such an 
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ordering is guaranteed to exist/4].) Using T' as a guide, we 
then assign node v3 to cell 1, node vs to cell 2, node v1 
to cell 3, and so on, The assignment of nodes to cells in 
topological order guarantees that data produced by a cell 
is not needed by cells to its left. 

Data flows through the Warp array from left to right 
along the X channel. For each sample, the measurement 
vector, state vector, and factored covariance matrix from 
the previous iteration are received by the leftmost cell from 
the input cluster. The updated state vector and covariance 
matrix are sent by the rightmost cell to the output cluster 
for use with the next sample. | 

Each cell has the same simple behavior. A cell re- 
ceives all of its inputs from its left neighbor. Data that 
is needed for the computation is stored in local memory 
and data that is required by cells to the right is sent to 
its right neighbor. The cell then performs its computation 
and sends the results to its right neighbor. This is repeated 


Figure 4: Mapping on Warp 
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Figure 5: Performance 


for each sample. 


The mapping we chose for our first implementation is 
unbalanced, does not scale to linear arrays of arbitrary size, 
and does not “exploit all of the potential parallelism”. The 
simple mapping did, however, enable us to quickly imple- 
ment and debug the filter on Warp, to test it using realis- 
tic data, to alleviate our concerns about software divides 
and square-roots, and to gather some baseline performance 
data. 


Performance 


The key performance measure for the Kalman filter is sam- 
ple time, that is, the real time required to process one mea- 
surement sample. To compare the sample time of the Warp 
with other machines, we used the first 50 samples from the 
benchmark trajectory. We ran a C version of the Kalman 
filter, with compiler optimization enabled, on a Sun-3/75 
with 68881 floating-point coprocessor, a Vax 11/780 with 
floating-point accelerator, and a Cray-2. The sample times 
for Warp and these machines are listed in Figure 5. 


The C programs do not access disk; all data are loaded 
with the programs and are available when the programs be- 
gin executing. The C program for the Cray was compiled 
using Cray’s new vectorizing C compiler. Since the bench- 
mark was not fine-tuned for the Cray-2, we are probably 
not getting the full benefit of the Cray-2 vector units. How- 
ever, with vectorization turned off, the sample time for the 
Cray-2 was identical to the sample time for Warp. 

~The sample times in Figure 5 for the Sun-3/75, Vax 
11/780, and Cray-2 were obtained using the tume command 
of the Unix shell. Total user CPU seconds was divided by 
a factor of 50 (the number of samples) to arrive at the 
sample time. We also ran the filters using thousands of 
measurements, with no significant effect on sample time. 

The sample time for Warp was obtained using a built- 
in timing facility on the external host.. We noted the time 
that the input cluster started executing before processing 
the first sample and we noted the time that the output 
cluster finished executing after processing the last sample. 


11] 


We divided the difference by 50 to arrive at the elapsed 
time per sample. The Warp sample time includes the time 
required to transfer data between the external host and the 
Warp array, as well as startup times for the external host 
and Warp array. | 

We are encouraged by the Warp numbers, especially 
in view of the naive mapping. Only six of the ten available 
cells are used and the mapping is unbalanced. Further, 
in order to simplify the implementation, we restarted the 
external host and Warp array from the Sun master for each 
new sample. This (completely avoidable) restart accounted 
for 1/2 of the 12 ms. sample time for the Warp. Our 
current goal is to reduce the sample time to 1 ms. _ by 
developing a more balanced implementation that uses all 
10 cells, by eliminating the restart for each sample, and 
by passing the updated state vector and estimation error 
covariance matrix from the rightmost cell to the leftmost 
cell along the Y channel rather than going back through 
the clusters. 


V. Conclusions 


We have described the implementation of a 9-state ex- 
tended Kalman filter for target tracking applications on 
the Warp computer. It was shown that Warp, using a 
very simple mapping of the Kalman filter, achieved speeds 
within a factor of four of a Cray-2. The work has expanded 
the application domain of the Warp computer to include 
Kalman filtering, a general technique with many applica- 
tions. 
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Abstract -- Although many solution methods are available for 
the linear programming problem, the simplex method is 
undoubtedly the most widely used one for its simplicity. In 
this paper, we shall propose an implementation of the simplex 
method on fixed-size hypercubes. A partitioning technique and 
a mapping technique are also presented to fit large-size problem 
instances into relatively small-size hypercubes. Two cases, 
pipelined broadcastings allowed and pipelined broadcastings 
not allowed, are considered. We have shown that the proposed 
implementation achieves the optimal speedup asymptotically for 
the both cases. In addition, we have derived sufficient 
conditions for optimal partitionings when the problem instance 
sizes are considered finite. These sufficient conditions will be 
useful to obtain better partitionings. Further, optimal 
partitionings are found for some special cases. 


1. Introduction 


Linear programming is a fundamental problem in 
operations research, and has received much attention for its 
importance. Mathematically, this problem can be formulated in 
standard form [11] as follows. 


minimize  Zz=Cx (1) 
subjectto Ax=d 
x>0 


where A=[a;)] is an MxN constraint matrix, 0<i<M-1, 


O<j<n-1, d=[d),d,,....dy1]' iS a positive column vector of 
length M, c=[Cp,C),.. 


.sCy_1] 1S a row vector of length N, and 
X=[Xp,X1)..%Xy.j]) is a column vector of length N. The linear 


programming problem is to find the minimum of z. Dantzig [5] 
has proposed a well-known solution, the simplex method, for 
this problem. The simplex method starts from an initial 
feasible solution, and then moves continuously from one 
feasible solution to another, if improvement is obtained. Since 
the feasible solution space is a convex set, the optimum will be 
reached eventually after a finite number of iterations. The 
sequential time for each iteration is O(MN). 

The simplex method have been implemented on several 
parallel machines. For example, a VLSI wavefront array 
processor implementation was proposed by Onaga and 
Nagayasu[10], and a VLSI mesh of trees implementation was 
proposed by Bertossi and Bonuccelli[2]. Onaga and 
Nagayasu's approach uses MxN processors and requires 
O(N+M) time for each iteration. Bertossi and Bonuccelli's 
approach uses MxN mesh of trees with O(MNlogMlog>n) area 
and requires O(logN) time for each iteration. Both approaches 
need extra hardware and more complicated control when no 
sufficient processors are available. In Onaga and Nagayasu's 
approach, some fixed-size chips are connected together to fit 
problem instance sizes. The extra hardware connections incur 
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a lot of cost, and the intra-chip controls and communications 
are not trivial. Moreover, they proposed an alternative that the 
constraint matrix is folded into a linear array. So, each 
processor holds a submatrix. However, this still complicates 
the necessary controls and communications. Bertossi and 
Bonuccelli proposed another partitioning approach in which 
the constraint matrix is partitioned into submatrices, each with 
the same size as the mesh of trees. Their approach needs to 
perform several successive inputs and outputs during each 
iteration, which may slow down the execution. Finally, both 
of them can not achieve the optimal speedup. 

The hypercube i is a high-connectivity and regular parallel 
architecture. These two properties favor communications 
among processors. In practice, any two processors can 


communicate each other within log, p steps, where p is the 


number of processors in the hypercube. Further, the ability of 
fault tolerance enhances its reliability. To fully utilize these 
advantages, several practical hypercube machines have been 
built, such as the Caltech's Cosmic Cube[13], the NCUBE 
computer [6], the Intel iPSC[8], the Connection Machine[7], 
and the Butterfly Machine[3]. In this paper, we shall propose 
an implementation of the simplex method on fixed-size 
hypercubes. We consider two cases: pipelined broadcastings 
allowed and pipelined broadcastings not allowed. We have 
shown that both cases can achieve the optimal speedup 
asymptotically. In addition, we focus our attention on 
minimizing communication overheads when the problem 
instance sizes are considered finite. Sufficient conditions for 
optimal partitionings (here, optimal partitionings mean those 
that optimize communication given the mapping method) are 
derived which will be useful to obtain better partitionings. 
Also, optimal partitionings are found for some special cases. 

The remainder of this paper is organized as follows. The 
next section briefly reviews the simplex method. Section 3 
describes the hyprecube and the embedded hypercube. In 
section 4, we present an implementation of the simplex method 
on fixed-size hypercubes. A partitioning technique and a 
mapping technique are presented to fit large-size problem 
instances into relatively small-size hypercubes. Also, we have 
a discussion on optimal partitionings. Finally, concluding 
remarks are given in section 5. 


2. Ihe Simplex Method 


In this section we shall briefly review the simplex method. 
For more detailed description, the interested reader may consult 
[14]. Given a linear programming problem in standard form of 
(1), the simplex method starts from an initial feasible solution 
and moves toward the optimal solution. The execution is 
performed by an iterative procedure. In each iteration, a basic 
solution will be obtained by setting N-M variables equal to zero. 
These variables are called nonbasic variables; the remaining 
ones are called basic variables. With respect to the simplex 
method, the basic variables in adjacent iterations are only 


different by one. Therefore, the basic solution for the next 
iteration can be obtained from the current basic solution by 
exchanging one nonbasic variable for one basic variable. The 
nonbasic variable that "enters" the basic solution is called the 
entering variable, and the basic variable that "leaves" the basic 
solution is called the leaving variable. These two variables can 
be determined according to the optimality condition and the 
feasibility condition as stated below. 

Optimality condition. The entering variable is the nonbasic 
variable with the most negative coefficient c;. A tie is broken 


arbitrarily. When all the nonbasic variables have nonnegative 
c;'s, the current value of z is optimal. 

Feasibility condition. Assume that x, is selected as the entering 
variable. Then, the leaving variable is the basic variable x, 
satisfying a,=1 and d/a,=min{dja,| O<i<M-1 and a, >0}. 
Further, a,, is called the pivot element, and the (t+1)-th row 


(the (u+1)-th column) of A is called the pivot row (the pivot 
column). 

After determining the entering variable and the leaving 
variable, the constraint matrix A and the vectors c,d ,and z are 
updated as follows. 


a, <-- Asay, (2) 
a;; <-- Aj - A, , lAt (3) 
dj <- dJa,, (4) 
d;' <-- d; = a; a; 9 iAt (5) 
cho << c;- Cay (6) 
z' <-- z+c,d;, (7) 


where j=0,1,...,V-1, and aij, d,,c;, and z (a;;;, d;, Cis and z’') 
denote the old (new) values. The use of both the optimality 
condition and the feasibility condition will result in a better 
feasible basic solution in each iteration. Since all the feasible 
basic solutions correspond to the vertices of a convex polytope 
in N dimensions, the optimum will be reached eventually after a 
finite number of iterations. Let us assume that performing each 
binary operation requires the same time and therefore counts 
one computation step. Then, the total number of (sequential) 
computation steps required for each iteration is 
2MN+2N+3M-1. 


3. [he Hypercube and the Embedded Hypercube 


In general, there are two classifications of parallel 
processing models [9]. The shared memory model 
characterizes the tightly coupled processing, and the distributed 
memory model characterizes the loosely coupled processing. 
In the latter, it is assumed that each processor has its own local 
memory and communicates with others through the 
interconnection mechanisms. The hypercube belongs to the 
distributed memory model. 

An h-dimensional hypercube contains p=2" (h is a positive 
integer) identical processors. Each processor is given an h-bit 
address (b,,b, 1,...50,); b=0,1, 1<i<h, and there exists a link 
between two processors if and only if their addresses differ in 
exactly one bit position. The hypercube has a recursive 
structure as explained below. An h-dimensional hypercube can 
be regarded as composed of two (h-1)-dimensional 
hypercubes, one with b,=0 and the other with b,=1. 


Similarly, each of the two (h-1)-dimensional hypercubes can be 
further regarded as composed of two (h-2)-dimensional 
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hypercubes, one with b, ,=0 and the other with b, ,=1, and so 
on. Generally, each of the sets of processors whose addresses 
differ in k (1<k<h) specified bit positions b,,,b,,,...,5, and 
are the same in remaining (h-k) bit positions forms a 
k-dimensional hypercube. Such a hypercube is called a 
k-dimensional embedded hypercube on (0;,,D;»,...,D;,). 


One important issue to implement the simplex method on 
the hypercube is data broadcasting on some embedded 
hypercubes. That is, some designated processors are 
necessary to transmit data to all the other processors belonging 
to the same embedded hypercubes. These data include the 
pivot element, the pivot row, the pivot column, etc. To 
broadcast on a k-dimensional embedded hypercube (on 
(b;,,D;91.-..b,)), K Communication steps are necessary. 


Initially, the data to be broadcast are located in the designated 
processor. In the r-th step (1<rsk), each of the processors 
owning the data sends a duplicate to the processor whose 
address differs from its address in the bit position b,.. Thus, k 


communication steps are sufficient to complete the 
broadcasting. By taking advantage of the property of fast 
broadcasting, we can implement the simplex method efficiently 
on the hypercube. 

From the above discussion, two properties of the 
hypercubes immediately follow. 
Property 1. .Broadcasting on a k-dimensional embedded 
hypercube requires k communication steps. 
An operation is called a semigroup operation if it is associative. 
Some well-known semigroup operations are addition, 
multiplication, finding maximum, and finding minimum. The 
following property can be obtained from Property 1. 
Property 2. Performing semigroup operations on a 
k-dimensional embedded hypercube (one operand in each 
processor) requires k communication steps and k computation 
steps. Additional k communication steps are necessary if the 
computation result is required by every processor. 


4. Ihe Implementation of the Simplex Method 
ved Siva Lvoorcubes.” 


In this section, we shall propose an approach to implement 
the simplex method on fixed-size hypercubes. Suppose that 
the hypercube is h-dimensional and therefore contains p=2' 
processors, where MN2p is assumed. The case of MN<p is 
trivial. Further, we assume that each processor can 
simultaneously communicate (send data or receive data) with 
its neighbor processors within a communication step. 
Therefore, if a processor wants to broadcast w data on a 
k-dimensional embedded hypercube, it can send out these data 
in consecutive w communication steps. After k+w-1 
communication steps, the broadcasting can be completed[4]. 
Broadcasting in such a way is called pipelined broadcasting. 

Before presenting our approach, let us discuss the 
partitionings of data and then the mappings of data onto the 
hypercube. First, we consider the simple case of MN=p. In 
this case, the data partitionings are not necessary since the 
sizes of A,c, and d are not larger than the size of the 
hypercube. Accordingly, A, c, and d are mapped directly onto 
the hypercube as follows (let g=log,N). 


1. a;, is placed into the processor with address 


(DyDy. por g4 pPgrD p> where bP p19 941 is the binary 
representation of i and b,...D, is the binary representation of /. 


2. c, is placed into with address 


; the processor 


(0,0,...,0,D4.---9), where b,...D is the binary representation 
of j. 


3. d; is placed into the processor with address 


(Dp Oy po gy tld) where bP pO 941 is the binary 


representation of i. 


In the case of MN>p, the size of A is larger than the size of 
the hypercube. Therefore, A must be partitioned into p 
equal-size submatrices before it can be mapped onto the 
hypercube. Since A is an MxN matrix, each of the submatrices 
is of size k,xk,, where k,k,=MN/p. Without loss of generality, 
let us assume that M=k,2” and N=k,2” (m+n=h) (if not so, we 
can extend A appropriately by adding dummy rows and 
dummy columns to A) in the following discussion. The 
partitioning of A is shown in Figure 1(a), where A ;,'s 
(O<i<2”-1 and O<j<2”-1) represent the submatrices. As for c 
and d, they must be partitioned as well. The partitionings of c 
and d are shown in Figure 1(b) and Figure 1(c) respectively, 
where c;s (O<sjs2"-1) and d,s (Osis2”-1) represent row 
subvectors of length k, and column subvectors of length k, 


respectively. The mappings of A, c, and d onto the hypercube 
are as follows. 


1. A; is placed into the processor with address 
(DD py po+P ng 1D po291), Where b,b, ,...0,,,, 1s the binary 
representation of i and b,...b, is the binary representation of j. 


2; c; is placed into the processor with address 
(0,0,...,0,5,,...,9,), where b,...b, is the binary representation 
of j. 


3. d, is placed into the processor with address 
(Dy Dye Png plol...1), where b,b, )...0,,; 18 the binary 
representation of 7. 


In both cases, z is placed into the processor with address 
(0,0,...,0,1,....1) (that is, b,=b, ,=...=b,,,=0 and 
b=...=b,=1). 

According to the above mappings, four facts are as 
follows. 
Fact I. For any fixed i, the submatrices Ajs, O<js2”-1, are 
mapped onto an n-dimensional embedded hypercube on 
(D_,...:9,). Moreover, each of the processors in the embedded 


hypercube has the most significant m bits equal to the binary 
representation of i. For easy description, we shall refer to this 
embedded hypercube as row-i embedded hypercube. 

Fact 2. For any fixed /, the submatrices A;'s, O<i<2?”-1, 
are mapped onto an m-dimensional embedded hypercube on 
(Dy :Dp12+++2P p41). Moreover, each of the processors in the 


embedded hypercube has the least significant n bits equal to the 
binary representation of j. In the following discussion, we 
shall refer to this embedded hypercube as column-j embedded 
hypercube. 

Fact 3. The row vector ¢ is mapped onto the row-0 
embedded hypercube. 

Fact 4. The column vector d is mapped onto the 
column-(2”-1) embedded hypercube. 

Now, it is time to describe the implementation of the 
simplex method on the hypercube. Since the simplex method 
consists of a finite number of iterations, we shall concentrate 
our efforts on the necessary operations of each iteration. These 


n+1 
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operations include, determining the pivot row, the pivot 
column, and the pivot element, and updating A, d, c, and z as 
stated by (2), (3), (4), and (5) respectively. In the following, 
we describe the implementation of these operations. 


etermin Piv 


The operands required for this operation are c. According 
to Fact 3, we know that k, elements of ¢ are distributed in each 


processor of the row-0 embedded hypercube. Thus, the most 
negative element in each processor is determined first. This 
takes k,-1 computation steps. Then, according to Property 2, 


the most negative element of c can be determined taking 2n 
communication steps and n computation steps. The pivot 
column is the (u+1)-th column of A, if c,, OSu<N-1, is the 


most negative element. In case of no negative elements, the 
current value of z is optimal. 


Determine the Pivot Row and the Pivot Element 

The operands required for this operation are d and the 
pivot column. First, each subvector d;, Osi<2”-1, is sent 
to the processor that holds A,,, where y=u DIV k, and u 


is the index of the pivot column. Since d; and A,, are in the 
same embedded hypercube (the row-i embedded hypercube), 
the transmissions can be pipelined and performed in parallel. 
Therefore, k,+n-1 communication steps are required. Then, 
the minimum of dj/a;,,'s with a;,>0 is determined in each. 
processor. This takes at most 3k,-1 computation steps. 


Finally, the minimum of these 2” minima is determined. Since 
these minima are in the column-u embedded hypercube, 2m 
communication steps and m computation steps are required. 
The pivot row is the (t+1)-th row of A, if d/a,, is the 


minimum. Besides, a,, is the pivot element. 


Update A. d. ¢. and z 


(a) The pivot element is broadcast on the row-(t DIV k,) 


embedded hypercube, and then the pivot row is updated as 
stated by (2). This takes n communication steps and one 
computation step. 


(b) The elements of the pivot column belonging to Aj,; where 
Osis2”-1 and y =u DIV k,, are broadcast on the row-i 
embedded hypercube. Since these broadcastings can be 


pipelined and performed in parallel, k,+n-1 communication 
steps are required. 


(c) 


Cc, 18 broadcast on the row-0 embedded hypercube. This 
takes n communication steps. 


(d) The elements of the pivot row belonging to A,i where 
Osjs2”-1 and x=t DIV k,, are broadcast on the column-j 
embedded hypercube. This takes k,+m-1 communication 
steps. 


(ce) d,is updated as stated by (4). This takes one computation 


step. 


(f) 


d, is broadcast on the column-(2”-1) embedded hypercube. 
This takes m communication steps. 


(g) A, c, d, and z are updated as stated by (3), (5), (6), and 
(7) respectively. This takes totally 2(k,k,+k,+k,+1) 
computation steps. 


Thus, the total numbers of (parallel) computation steps and 
(parallel) communication steps required for each iteration are 


(8) 
(9) 


2k Kyt5k,+3k,+h+2 
and 
2k, +k,+2n+4h-3 


respectively. Since k,k, is equal to MN/p, the optimal speedup 


is achieved asymptotically. On the other hand, if MN is 
considered finite, then we choose k, and k, to minimize (9) 


(It is quite necessary as the communication steps are much 
costly, compared with the computation steps). By substituting 
MN/pk, for k, and logs(pk,/M) for n, (9) becomes 


2k, +MN/pk,+2log,(pk,/M)+4h-3. (10) 


Then, differentiating (10) at k, and equalizing it to 0, we have 


(11) 


where e denotes the base of the natural logarithm. It is 
impossible to solve (11) for integer k,. However, we think that 
it is useful to obtain a better k,. 

In the above discussion, we assume that pipelined 
broadcastings are allowed. In case pipelined broadcastings are 
not allowed, broadcasting k, (k,) data on an n(m)-dimensional 
embedded hypercube requires k,n (k,m) communication 


steps. In this case, the total number of communication steps 
required for each iteration is equal to 


2-MN/pk,?+(2/k, loge = 0, 


(2k, +1)n+k,m+3h. (12) 


It is clear that the optimal speedup can still be achieved 
asymptotically when pipelined broadcastings are not allowed. 
On the other hand, if MN is considered finite, then k, and k, 


must be chosen carefully to minimize (12). By substituting 
MN/pk, for k, and log.(pk,/M ) for n, (12) becomes 


(2k, +1)log,(pk,/M) + (MNipk,)log,(M/k,)+3h. (13) 


Then, differentiating (13) atk, and equalizing it to 0, we have 


2log,(pk,/M)-(MN/pk,2)log,(M/k,)+(2+(1/k,) 


-(MN/pk,*))logye = 0. (14) 


It is very difficult (even impossible) to solve (14) for integer 
k,. However, the constant 1 in (13) is negligible when 


compared with 2k,, and (13) can therefore be simplified to 


2k,log,(pk,/M) + (MNipk,)log,(M/k,)+3h. (15) 


In the following, we show that 2k,=k, will minimize (15) 
when 2M=N. 


Lemma 1. If the number of communication steps required can 
be expressed in the form of 
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k,nt+k,mtc, (16) 


where c is a constant, then k,=k, =(MN/p)'” minimizes (16) 
when MEN. 

Proof. Substituting MN/pk, for k, and logy(pk,/M) for n, 
(16) becomes 


k,log,(pk,/M) + (MNipk, )log,(M/k,)+c. (17) 
By differentiating (17) at k, and equalizing it to 0, we have 


log,(pk,/M)-(MN/pk,?)log,(M/k,) 


+(1-(MN/pk,”))log.e = 0. (18) 


It is not difficult to check that k, =k =(MN/p)'4 is a solution of 
(18) when M=N. Thus, this lemma follows. Q.E.D. 


Lemma 2. If the number of communication steps required can 


be expressed in the form of 
ak n+bk,m+c, (19) 


where a,b and c are constants, then ak j=bk, minimizes (19) 
when aM=DN. 


Proof. This lemma is a generalization of Lemma 1. 


Substituting MN/pk, for k, and logy(pk,/M) for n, (19) 
becomes 


ak, log,(pk,/M) + b(MNipk, )log,(M/k,)+c. (20) 


Let k,'=ak,, M'=aM, and N'=bN. Then, k,'=M'N'/pk,'=bk,. 
By introducing k,', M' and M’ into (20), we have 


k‘log,(pk,'/M') + (M'N'/pk,')log,(M'/k,')+c, (21) 
which is in the same form as (17). Thus, according to Lemma 
1, we know that k,'=k,' minimizes (21) when M'=N’. Thus 
the lemma follows. Q.E.D. 


Since (20) is a general form of (15), it is proved that 2k,=k, 
minimize (15) when 2M=N. 


9. Concluding Remarks 


The simplex method is a well-known solution method for 
the linear programming problem. In this paper, we have 
proposed an implementation of the simplex method on 
fixed-size hypercubes. Two cases, pipelined broadcastings 
allowed and pipelined broadcastings not allowed, were 
considered. For both cases, the optimal speedup can be 
achieved asymptotically. When the problem instance sizes are 
considered finite, we derived two sufficient conditions for 
optimal partitionings. It is difficult to obtain optimal 
partitionings from these two conditions. However, we think 
that suboptimal partitionings can be obtained with the aid of 
them. Besides, we have obtained optimal partitionings for 
some special cases. Although the simplex method can also be 
implemented on other parallel architectures [2],[10], they can 
not achieve the asymptotically optimal speedup. Besides, the 
numbers of processors they used are dependent upon the 
problem instance sizes. 


Since the linear programming problem we considered in 
this paper is in standard form, a basic feasible solution must be 
provided initially. This initial feasible solution can be obtained 
by using the two-phase technique [11]. In the proposed 
implementation, the processor with address (0,0....,0,1,...,1) 
(b,=b,.,=---=9,4,=0 and b)=...=b,=1) have more 
computation loads than others. If its computation loads can be 
shared by the other processors, then the number of 
computation steps required for each iteration can be further 
reduced. To do this is not difficult. We can simply let 
M=k,2”-1 and N=k,2”-1 and reduce the sizes of Ags and 


Ajgn.1yS; where Osjs2”-1 and Osis2”-1, to (k,-1)x(k,-1). 
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Fig. 1. The partitionings of (a) A, (b) d, and (c) ¢. 
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Abstract -- Parallel computations are presented for stochastic 
dynamic programming problems arising from the optimal control of 
nonlinear, continuous time dynamical systems, perturbed by Poisson 
as well as Gaussian random white noise. The numerical formulation 
is highly suitable for a vector multiprocessor. Column-oriented code 
can run more than twice as fast as row-oriented code for highly 
refined meshes, but row-oriented code is more efficient and practical 
for coarser meshes. Advanced computing techniques and hardware 
help alleviate Bellman’s curse of dimensionality in dynamic 
programming computations. 


1. Problem Summary 


The governing Markov dynamical system is the stochastic 
differential equation (SDE): 


dy(s) = F(y,s,u)ds + G(y,s)dW(s) + H(y,s)dP(s) , 


(1.1) 
ya=x; O<t<s<y; y(is)e Dy; ue D,, 


where y(s) is the mx1 state vector at time s starting at time ¢, 
u=u(y,s) is the mx1 feedback control vector, W is the r- 
dimensional normalized Gaussian white noise vector, P is the 
independent g-dimensional Poisson white noise vector with jump rate 
vector 4, F is the nx1 deterministic nonlinearity vector, G is an 
nxr diffusion coefficient array, and H is an nxq Poisson amplitude 
coefficient array. Some applications are models of resources in an 
uncertain environment [8], [10], [7], and flight dynamics under 
random wind conditions [2]. 


The control criterion is the optimal expected cost performance, 
V(x.) = MIN [MEAN [Vly,s,u,P,W] | y(t) = x1] , 
u ’ 


t (1.2) 
Viy,tu,P,W] = J ds C(y(s),s,u(y(s),s)) , 


on the time horizon (¢, t), where the instantaneous cost function 
C = C(x,t,u) is assumed to be a quadratic function of the control, 


C(x,t,u) = Co(x,t) + Ch(x,)u + “’utCo(x,Ou . (1.3) 
The unit cost of the control increases with u when C2 is positive 


definite. In addition, the dynamics in (1.1) are assumed to be linear 
in the controls, 


F(x,t.u) = Fo(x,) + Fix.ou , (1.4) 
remaining nonlinear in the state variable x. 
The Bellman functional PDE of dynamic programming, 
O= V4 LIV] = Vi + FIV" + YGG"(x,):VV'V" 
» (1.5) 


q * 
+ Arl VxtHx,),0-V"(x.) 1 + Co + (2U"—-Up)C2U" , 
1 
follows from the generalized /t6 chain rule for Markov SDEs as in 
[6] and [10], where U" is the optimal feedback control computed by 
constraining the unconstrained or regular control, 
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Ur(x,t) = —Cz(C, of F,VV’) ? (1.6) 


to the control set D,. In general, the PDE (1.5) is nonlinear with 
discontinuous coefficients. 


As the number of state variables, m, increases, the spatial 
dimension rises, and computational difficulties are present that can 
compare to those of three-dimensional fluid dynamics computations. 
This is the famous Bellman’s curse of dimensionality [3]. Thus there 
is a great need to make use of advanced-architecture computers, to 
use parallelization as well as vectorization. 


2. Numerical Summary 


The integration of the PDE in (1.5) is backward in time, 
because V" is specified finally at the final time ¢ = ty, rather than at 
the initial time. A summary of the discretization in state and 
backward time is given below: 


x > X= Kida = Kia + Gi - 1)DXideva » 
jJ=Udewa. wherej;=1toM;, fori=1ltom; 
s>T,=t-(-1)DT, fork=1toK; 
V"(XT) > Vues LIV VXTey) 2 Lyn ; 


where DX; is the mesh size for state i and DT is the step size in 
backward time. 


The numerical algorithm is a modification of the predictor- 
corrector Crank Nicolson methods for nonlinear parabolic PDEs in 
[5]. Modifications are made for the switch term and delay term 
calculations. Derivatives were approximated with an accuracy that 
was second order in the local truncation error. Variations of this 
algorithm have been successfully utilized in [10] and [7]. 


Prior to calculating the values, Vj,4:, at the new (k+1)* time 
step for k = 1 to K-1, the old values, Vy, and Vj;-1, are assumed to 
be known, with Vjo = Vj,. The algorithm begins with an extrapolator 
(x) start: 


(2.1) 


Violin = 23-Vie - Viz) - (2.2) 
These evaluations are used in the extrapolated predictor (xp) step: 


VPP = Vy + DT + ALP - (2.3) 
which are then used in the predictor evaluation (xpe) step: 
VEE, = ACVER: + Vpn - (2.4) 
The evaluated predictions are used in the corrector (xpec) step: 
VeRErY = Vy + DT - Lee? (2.5) 


for Y = 0 tO Ymax While stopping criterion unmet, with corrector 
evaluation (xpece) step: 


VpREED = ACVppM) + Vy). (2.6) 


The stopping criterion for the corrections is a heuristically motivated 
comparison to a predictor-corrector convergence criterion for a 
linearized, constant coefficient PDE [9]. 


Parallelization and vectorization of the algorithm was done by 
what might be called the "Machine Computational Model Method", 
ie., tuning the code to optimizable constructs that are automatically 
recognized by the compiler, with the Alliant FX/8 vector 
multiprocesor [1] in mind. All inner double loops were reordered to 


fit the Alliant concurrent outer - vector inner (COVI) model. All Figure 1. Column Timings vs. M 
non-short single loops were made vector-concurrent Short loops 
became scalar-concurrent only. Multiple nested loops were : 
reordered with the two largest loops innermost. A total of 37 out of pS Er een 
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39 loops was optimized. 2 
Dongarra, Gustavson, and Karp [4] have demonstrated that loop rogers 
reordering gives vector or supervector performance for linear algebra wu pe 1 CES 
loops on a CRAY-1 type column-oriented FORTRAN environment * © p=2CEs 
with vector registers. Here performance measurements have been = 4 p=4CEs 
made with both column-oriented loop code and row-oriented loop om + p=8CEs 
code, in order to make a comparison for our particular problem and o 
machine environment. In the column-oriented loop, the most inner Se 
loop iterates on the first subscript, ---, and the most outer loop o> 
iterates on the last subscript, as in the following code fragment that 2 
is the main not inner-COVI loop from the corrector step (2.5): Ee 
Fa 
do 21 l=1,m ox 
do 21 j2=1,M2 3 
do 21 j1=1,M1 a 
ss( fl, j2,D=F1( jl, j2,)*DV( jl, j2,D S- 
& + GSQC jl, j2,)*DDV( /1, j2,1) 22 
& + lambda(l)*(HV( jl, j2,) — VM j1, j2)) = 
21 continue = 
whre m = 2 = n with the array subscripts and loop nesting ordered so ‘Oo 
that the major part of the finite difference work is done in the two 


most inner loops. In the row-oriented loop the inner two subscripts 
are iterated in reverse order as follows: 


do 22 l=1,m 7 10 


100 : 1000 
do 22 j1=1,M1 M, Number of Mesh Points 
do 22 j2=1,M2 In Figure 2, the dependence of the user execution time for 
ss(l, fl, j2)=FU(L, jl, j2)*DV(L, jl, j2) sepiaicnea code is ps versus the common mesh size M. The 
& + GSQ(I, jl, j2)*DDV(, j1, j2) transition between T(p;M:row) = O(M>) to O(M‘) overhead occurs 
& + lambda(l)*(AV(, j1, j2) - VMC(j1, j2)) between M = 126 and M = 127, but is much sharper than in the 
22 continue column-oriented Figure 1. The assembler encoding of the do 22 loop 
exactly reflects this transition, even for pure double loops. The extra 
The row orientation in this do 22 loop refers to an array slice with overhead in the Om’) region is nominally due to the fact that 
the state counter / fixed. In the two above code fragments, multiples of 32 beyond 64 are not treated as multiples by the Alliant 
GSQ = Y%4G'G and VM = V at Thar: compiler. 


The advantages of the algorithm is that it 1) permits the Figure 2. Row Timings vs. M 
treatment of general continuous time Markov noise or deterministic Stochastic Dynamic Programming 
problems without noise in the same code, 2) permits the cheap ; 
control limit to linear singular control to be found from the same ANL ACRF Alliant FX/8 
quadratic cost code, and 3) produces very vectorizable and 
parallelizable code whose performance is described in the next 
section. 


10° 


Legend 
oO p=1CEs 
2) p=2CEs 
A p=4CEs 
+ p =8 CEs 


3. Results Summary 


The vector multiprocessor used for our performance 
measurements was the Alliant FX/8 in the Advanced Computing 
Research Facility (ACRF) at Argonne National Laboratory. This 
Alliant FX/8 has 8 vector Computing Elements (CEs). Each of the 
CEs has eight vector registers whose length is 32 eight-byte 
elements, and the CEs are connected to a 128 KB cache. 


The two state and two-control resource model in [7] was used 
as the test example. The two controls represent removals from the 
system by respective commercial and recreational users of the 
system. In this test example, only Poisson noise was used for both 
state populations. 


10° 10° 


10° 


10° 


Figure 1 shows the dependence of the user or CPU time on the 
common number of mesh points, M, = M = Mp, in a log-log plot, for 
the column-oriented 2-state and 2-control code. K = 4:(M-1)+ 1 to 
maintain stability properties. When M is between 16 and 64, the 
user time 7(p;M;col) = Om’), because the slope is quite close to 
three. This corresponds to a COVI loop dominated limit with m = 2 
states and K = O(M) time steps. However, when M is 96 or greater, 
T(p;M) = O(M*), implying that additional overhead from the 
memory hierarchy is present that is equivalent to an extra state 
dimension (m— m+1). This change is reflected in the assembler 
code for the do 2] loop for M 2 65. 


T(p;M;row), User Time (seconds) 


10° 


1000 


100 
M, Number of Mesh Points 
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In Figure 3, the ratio of the row-oriented to column-oriented 
user execution time is displayed in a 3-D representation versus both 
M and the number of processors p. The effect of the number of 
processors p is much weaker on the ratio of the timings than that of 
the mesh size M is on the timings, because the array orientation is 
primarily a memory problem. While the row-oriented version is 
slightly faster for M < 64, the column-oriented version is up to two 
times faster for finer meshes. Hence the column advantage in this 
FORTRAN environment is present only for the finer meshes with 
this stochastic dynamic programming code and the Alliant FX/8. 
Unfortunately, this ee occurs in a region of extra overhead, 
where the timings are U/(M"). 


Figure 3. Row to Column Ratio vs. p & M 
Stochastic Dynamic Programming 
ANL ACRF Alliant FX/8 
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4. Conclusions 


Stochastic dynamic programming is practical for several to a 
moderate number of state variables using a vector multiprocessor. In 
order to handle a large number of state variables, a large number of 
parallel processors would be desirable, but Bellman’s curse of 
dimensionality appears to very much weakened. These techniques are 
generally applicable to other vector and parallel computers, e.g., 
CRAY X-MP as applied to the most inner loops. Row-oriented 
codes can be very competitive with column-oriented codes for small 
to moderate mesh sizes (M < 64), but are twice as slow for finer 
meshes. 
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Abstract 


Fast parallel algorithms for the computation and evalua- 
tion of interpolating polynomials in the Lagrangian basis are 
presented. On n+1 data points, the interpolation algorithm 
requires [log n]+2 parallel arithmetic steps and n(n+1) pro- 
cessors. The results are compared with Egecioglu, Gallo- 
poulos and Koc’s [2] parallel Newtonian algorithm, and an 
extension to their work is presented. The algorithms dis- 
cussed are suitable for a shared memory PRAM architecture. 


Introduction 
Given a set of n+l pairs of real values, (x;,f;) for 
i =0,1,..... with distinct x;’s, there exists a unique polynomial 
Pn(x) of degree n such that p,(x;)=f; for i =0,1,...... This 
interpolating polynomial p,(x) can be written in the 
Newtonian form 


Pr@)= Sf oj (XX (XX 1) ++ &—%}_1) (1) 


i= 


in which the coefficients f 9; are the divided differences of f . 
Alternatively, p, (x) can be written in the Lagrangian form 


pax) = zh fi (t—x9) ++ Opa) (eau) «+ Hp) (2) 


where /; is the real factor of the Lagrange polynomial 1;(x). 


Various forms of polynomial interpolation have been stu- 
died in both the context of numerical analysis and in compu- 
tational complexity. However, Egecioglu, Gallopoulos and 
Koc [2] seem to be the first to study the problem in terms of 
the Newtonian form with respect to parallel solution. In fact, 
the typical form discussed is the classical form 

Pr) = Sp; x’ . (3) 
i=0 

Equations (1), (2) and (3) represent the same polynomial 
function under different bases. The usefulness of a particular 
basis depends on the application. Divided differences in par- 
ticular are useful for the evaluation of derivatives and 
integrals. The Lagrangian form has its own benefits including 
ease of derivation as will be shown, without adversely 
affecting such properties as permanence and evaluation. 


Egecioglu et. al. remark that fast interpolation algorithms 
(those with time complexity less than O(n”) ) are still based 
on slow serial schemes which require O(n?) operations (in the 
sequential machine). Because of the increasing availability of 
parallel systems, they suggest that existing methods are 
impractical because: 1) the constant multiplier of the order 
tends to be relatively large and 2) the current fast interpola- 
tion algorithms are subject to significant roundoff errors when 
implemented in finite precision arithmetic. 


The first reason suggests that the size of the problem must 
be sufficiently large to make these algorithms competitive. 
However, polynomial interpolation is not usually recom- 
mended on a large number of data points. The second reason 
is self-explanatory, why use fast algorithms if their numerical 
accuracy is unstable? Factors which affect both these areas 
are requirements such as equidistant points or the use of con- 
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volution techniques (e.g. fast Fourier transforms). 


The reader should note that over some fields, polynomial 
interpolation of a large number of data points is desirable. 
Modular techniques (c.f. McClellan [5]) can be used to solve 
large problems efficiently. In this respect the Chinese 
Remainder Theorem often occurs, and Zhang, Shirazi and 
Yun [8] have recently shown that the CRT problem can be 
solved in® O(log n) parallel steps. This result is applicable to 
modular techniques and may affect the complexity of parallel 
algorithms for large polynomial interpolation problems. 


In an attempt to address the concerns enumerated above, 
Egecioglu et. al. present a fast (O(log n) ) and practical paral- 
lel algorithm for the computation of interpolating polynomi- 
als (in Newtonian form). By practical they intend that the 
proposed algorithm be numerically stable and can be imple- 
mented in floating-point with resulting error accumulation 
similar to stable serial algorithms. Recently, Reif [6] showed 
that polynomial interpolation of n points in the classical form 
(3) can be performed in O(log n) time complexity, with a cir- 
cuit of size® of O(n7logn). As the method makes use of 
discrete Fourier transforms the constant multiplier in the time 
complexity for this algorithm is greater than 4. Thus, both 
Egecioglu et. al.’s results and those presented in this paper 
can be implemented with a circuit of smaller size and depth 
and without the use of convolution techniques. 


In this paper two separate problems are investigated. The 
first involves extending Egecioglu et. al.’s results to include 
the permanence property of the Newtonian form. The second 
problem is to show that the Lagrangian form can be interpo- 
lated and evaluated in a manner competitive with Egecioglu 
et. al.’s results. 


The algorithms discussed may be implemented on a 
shared memory PRAM architecture. Borrowing the descrip- 
tion from Kruskal, Rudolph and Snir [3] the PRAM computa- 
tional model consists of p autonomous processors, executing 
synchronously, all having access to a common shared 
memory. The processors execute in SIMD fashion, and 
memory access is assumed to be accomplished in one cycle. 


Parallel Newtonian Interpolation 

The success of Egecioglu et. al.’s algorithm depends on 
two key observations: 1) Computing the divided differences 
in an alternative form and 2) Using the parallel prefix compu- 
tation (see Kruskal, Rudolph and Snir [3] and Ladner and 
Fischer [4]) for the evaluation of the divided differences. 

Once the divided differences are computed the Newtonian 
representation of p, (x) is completely defined. Typically, the 
divided differences are computed by 


Fij-1~ Fiat 
fir (4) 
x; 4X; 
where f;;=f; and OSi<jsn. fo,fo,,-.-,fo, are the 


divided differences required by the Newtonian form. Note 


(a) All logarithms are basé 2. 
(b) Size refers to the number of productive nodes in a circuit. 


that from (4) the f; ;’s for a particular j—-i value can be calcu- 
lated independently of each other, and depend only on the 
divided differences calculated for the j-i-1 value and the 
x;’s. This yields a straightforward parallel algorithm where 
the divided differences for a particular j-i value are com- 
puted in parallel. Thus O(7) operations using O(n) processors 
are required to compute all the divided differences. 


Egecioglu et. al. present an alternative method that 
requires O(n”) processors and solves the problem in O(log n) 
steps. Let y;;=x;-—x, for i#j and y,;;=1, i,j =0,1,...,n. 
Then the kth (n2k20) divided difference of f (fo,) can be 
expressed as a linear combination of the f;’s and the products 
of the y; ;’s as follows 


k 


f; 
fox =X ; 


j=a0 fj 0OVj1 07° 


(5) 


Vik 


Consider the reciprocals of the coefficients of f; in 
foifoisv---> fon (the coefficient of f; in fo; is always 
zero for j<i). These are of the form y;oy;1°°* yiy, 
Viovia’’ Yiist +--+» VioVia’’’ Yin. But these are just the 
prefixes of y;9¥;,°°° y;, and the parallel prefix algorithm 
can generate all of them in [logn] using only n processors 
(c.f. Kruskal, Rudolph and Snir [3] and Ladner and Fischer 
[4]). Ladner and Fischer also show that such a parallel prefix 
algorithm may be constructed with a circuit of size O(n). 
There are n+1 data points, thus n+1 concurrent instances of 
the parallel prefix algorithm are needed to compute the 
prefixes of the term y;9y;1°°* Yi, for OSis<n. Hence O(n?) 
processors are needed to perform this calculation. Note, 
y;; =1 for all i, thus it can be omitted from the prefix calcula- 
tions without affecting the results. 


Egecioglu et. al. show that the Newtonian interpolation 
polynomial for n+1 points can be computed in less than 
2 [log (n+1) ] +2 parallel arithmetic steps using n(n+1) pro- 
cessors and can be implemented as an arithmetic circuit of 
size O(n”). Their assumptions require that all processors be 
able to perform any of the four arithmetic operations in one 
unit step and that the processors may be reused. 


Improving the Parallel Newtonian Algorithm 

The permanence property, or the ability to add a new data 
point and obtain the new interpolating polynomial p, ,,(@) on 
n+2 data points without recalculating all the divided 
differences is an aspect of the Newtonian form that Egecioglu 
et. al. leave open. From (4) it is clear that the divided 
differences f; ; for Osi<n would remain the same (provided 
the first n+1 data points are kept in the same sequence), and 
that n+2 additional divided differences f,,,, ; would have to be 
computed for O<j<n+1. This assumes that some of the 
divided differences are not discarded once p, (x) is calculated. 
As stated, the parallel Newtonian algorithm does not retain 
any information once it has calculated the coefficients of 
p,(x), thus, adding one data point requires that the algorithm 
execute again with (n+1)(n+2) processors. There is no prob- 
lem if that many processors are available but if not, it would 
be desirable to retain the capability of computing p,,,(x) in a 
reasonable amount of time. By sacrificing some storage the 
number of processors needed for the computation can be 
reduced significantly without increasing the computation time 
by more than a few arithmetic steps. 


Suppose the parallel Newtonian algorithm is executed 
again with n+2 data points (in the same sequence as in the 
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initial run that generated p, (x), adding the new data point to 
the end). Then, from (1) it follows that 


Pn+iX)=Pa®) + fons EK oMA—K 1) Ky) 


The only difference between p,,,(x) and p, (x) is in the last 
coefficient that is calculated. Now from (5) and with k =n+1 
it follows that 


n+l 


Ff ont al > 


jo Yj OI; 


fj 
"' "Di ntl 


Provided that the denominators of the coefficients of f,, have 
been previously stored, it is easy to calculate fo, 41. 


Theorem 1: The Newtonian interpolating polynomial 
for n+2 points can be computed from the Newtonian interpo- 
lating polynomial for n+1 points (where the first n+1 data 
points are in the same sequence) in at most 2 [log (n+2) | +4 
parallel arithmetic steps using n+2 processors. 


Proof: First, calculate ka and y,4;; for OSisn+1 followed 
by O;0%i1 °°° Yin) Vine: for OSi<n+1 (recall that the left- 
hand factor is available from the previous computation of 
p,(x)). This requires two parallel steps and one parallel step, 
respectively, using n+2 processors. Now perform a binary 
tree multiplication to compute Yasio¥n4i1 °° Ynttin Yatlnth 
followed by n+2 parallel divisions. This requires n+2 proces- 
sors and [log (n+2) ]|+1 parallel steps. Finally, use a binary 
tree addition algorithm to sum the n+2 elements and obtain 
fon+2 This requires [log (n+2) | further parallel steps. (] 


Parallel Lagrangian Interpolation 
Egecioglu et. al.’s parallel computation of certain features 
in the Newtonian basis (1) suggests that similar features in the 
Lagrangian form (2) might also be computed in parallel. The 
typical formulation of the Lagrangian representation (c.f. 


Abramowitz and Stegun [1]) is p,(x)=>1;@)f; where 


i=0 
(x) = (XX 9) 0 H—-Kj_PA-Kji4) + HX) 
(X;—X 0) + Hj jiKi 4) + Win) 
Using the notation developed earlier it follows from (2) that 


i= (6) 
Yi0vit “°° Yin 

for Osisn (recall that y;; = 1). This product is easily com- 
puted in [logn] parallel steps using n processors. n+1 such 
products are required, thus n+1 parallel instances of (6) are 
sufficient to calculate all the /;’s. Thus, n(n+1) processors are 
needed to compute p, (x) in the Lagrangian form. The algo- 
rithm is given by the following steps, assuming the data point 


pairs are stored in memory, and n(n+1) processors are avail- 
able: 


Parallel Lagrangian Interpolation 
1) Compute Vij =%i ~ 
already set. 


for OSi<n. 


x, for ij =O,l,....n, i#f. 


J Yi,i =1 


2) Compute a Iv: ee 
li jxo 


3) Compute and store /; for OSi<n. 


Theorem 2: The Lagrangian interpolating polynomial 
for n+1 points can be computed in at most [log n | +2 paral- 
lel arithmetic steps using n(n+1) processors and can be imple- 
mented as an arithmetic circuit of size O(n”). 


Proof: Clearly, Step 1 of the above algorithm requires one 
parallel step while Step 2 takes [log n] parallel steps using a 


binary tree multiplication algorithm. Finally, Step 3 takes one 
step to perform in parallel a division over n+1 processors. 
The processors involved are assumed to be reusable and able 
to perform the four basic arithmetic operations. In the case of 
an arithmetic circuit a binary tree multiplication of n inputs 
requires a circuit of depth [logn]| and size n. The division 
step requires one operational node and there are n+1 such 
divisions. The computation of the y;;’s requires n(n+1) 
operational nodes. Thus, the parallel Lagrangian algorithm 
for n+1 inputs can be performed with a circuit of depth 
[log n] +2 and size O(n”). 1 

Why should any of the bases (1), (2) or (3) be used? Why 
not go directly to the evaluation step with the data point pairs 


(x; fi) using the basic Lagrange formula 
n no (x —x, 

PA@)=>Zf; TT y ? While this computation could 
ia j,i % — Xj) 


be done in about 3 Jlogn |] +3 parallel arithmetic steps with 
n (n+1) processors, it it not competitive since, as is shown 
later, n+1 such evaluations could be performed by either the 
parallel Newtonian or parallel Lagrangian algorithm in about 
the same amount of time. 


Still another possibility would be to store the product 
l; f;. This increases the interpolation algorithm by one step 
and decreases the evaluation by one step. 


The Parallel Lagrangian Algorithm with Fewer Processors 

It has been assumed that there are sufficient processors to 
handle the entire problem simultaneously. Suppose there are 
a limited number of processors p, where p =m(n+1) and 
1<m <n is an integer. All those operations which can be done 
in one step using n concurrent instances may also be done in 
at most [n /m] steps. 


Theorem 3: The parallel Lagrangian algorithm for n+1 

data. points can be performed in at most 
2 [(n+1)/m]+ [logm] arithmetic steps when p =m(n+1) 
processors are available. 
Proof: First, the y; ;’s can be calculated in [n/m] steps 
using p processors for all i, j =0,1,....n, i#j. Now, using 
n+1 concurrent instances of a binary tree multiplication algo- 
rithm n+1 elements can be multiplied together to compute 
each 1//; using m processors in each instance. This is done 
by breaking the n+1 elements into m blocks each with at most 
[(2+1)/m] elements, and allowing each processor to multi- 
ply these elements together. This requires [(n+1)/m]-1 
steps. Now the m blocks can be multiplied together in 
[logm] steps. Finally, n+1 concurrent parallel divisions 
yields J; for Osi<n. [] 


If the number of available processors p is less than n+1 
then pieces of each instance may be performed in parallel. 
However, even the first step of the parallel Lagrangian algo- 
rithm, calculating all the y; ;’s requires [n(n+1)/p | parallel 
computations. This is likely be uncompetitive except when p 
is fairly large since the serial equivalent to this algorithm is of 
complexity O(n?). 

Similar results for the parallel Newtonian algorithm are 
shown by Egecioglu et. al. and are summarized in Table 1. 


Permanence Property for Parallel Lagrangian Interpolation 

As with the Newtonian form, it is possible to calculate the 
interpolating polynomial p,,,(x) using p,(x) in the Lagran- 
gian form. However, the Lagrangian form requires no addi- 
tional storage to perform the computation. 
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Theorem 4: The parallel Lagrangian interpolating poly- 
nomial for n+2 points can be computed from the Lagrangian 
interpolating polynomial for n+1 points (where the first n+1 
data points are in the same sequence) in at most 

[log (n +1) ]+4 parallel arithmetic steps using n+1 proces- 
SOTS. 


Proof: The original n+1 J;’s need only be divided by 
X; —X,41, which takes two parallel arithmetic steps using n+1 
processors. J,,,; can be computed in [log (n+1)]+2 arith- 
metic steps, one to calculate in parallel all the x,,, —x;’s for 
O<i<n, and the remainder to apply a binary tree multiplication 


algorithm to obtain 1/J,,, followed by a division to compute 


Ln+1- LJ 


Evaluation of the Lagrangian Interpolating Polynomial 
Egecioglu et. al. show that the Newtonian interpolating 

polynomial of degree n can be evaluated in 2 flog (n+1)]+2 
parallel arithmetic steps using n processors and can be imple- 
mented as an arithmetic circuit of size O(n). A similar result 
for the Lagrangian interpolating polynomial can be shown 
and the algorithm is given by the following steps, assuming 
the Lagrangian form is stored in memory, and n processors 
are available: 

Parallel Lagrangian Evaluation 

1) 


n-1 i 
Compute the prefixes of [[(-x,) and let z;,;=[[@ —x;) 
j=0 j=0 


for OSi <n and z = 1. 


n-l 
2) Compute the prefixes of [](@-*,_;) and let 
; j=0 
Z;-1 = [[@ —x,-_;) for 1si<n and z, =1. 
j=0 
3) Compute J; f; 2; z; for OSi<n. 
4) Sum the n+1 terms computed in steps 3) and store the 


result. | 


Theorem 5: The Lagrangian interpolating polynomial of 

degree n can be evaluated in at most 3 [logn]+9 parallel 
arithmetic steps using n processors and can be implemented 
as an arithmetic circuit of size O(n). 
Proof: Equation (2) shows that for OSi<n (x—x9)- +: (x-x;_)) 
(X-X}41) °* * «x, ) must be computed. Naively, the entire pro- 
duct (without omitting the term x—x;) can be computed, and 
then the required term can be divided out in n+1 separate 
copies. Unfortunately, this introduces a slightly larger error, 
as redundant floating point operations are performed for each 
i. With a slight increase in complexity the precision of the 
data can be preserved. 


First, compute the parallel prefixes of the product 
(X-Xo)-+: (x-x,_,) and let z;,; =(%-x9)--- (x-x;) for OSi<a 
with z9=1. Then compute the parallel prefixes of the product 
(X-xX,)°°°(%—x,) and let 7z;_;=(@-x,)-:-(x-x;) for 1si<n 
with z, =1. Each parallel prefix calculation requires [log n | 
steps, and it 1S easily seen that 
(X—X9) °° HX) —-%} 4.) + * HX, ) = 2; Z; for OSi<n. 

All the terms x—x; for OSi<n can be computed in two parallel 
steps. Thus, the entire sequence requires n processors to per- 
form two parallel prefix operations on n inputs followed by n 
processors performing three multiplications in parallel (to 
compute J; f; z; z; for OSi<n) and another three for the n+1th 
term. Finally, a binary addition algorithm can be used to sum 
the n+1 terms in [logn] +1 parallel steps with n processors. 


To prove the result for an arithmetic circuit, observe that both 
the parallel prefix algorithm, and a binary addition (multipli- 
cation) algorithm may be implemented in a circuit with depth 
at most [log n | and size O(n). [J 


It is left to the reader to show that the evaluation can be 
done in 2 [logn]|+4 parallel arithmetic steps if 2(n+1) pro- 
cessors are available. 


Conclusions and Comments 


A fast parallel algorithm for computing the Lagrangian 
interpolating polynomial similar to that for the Newtonian 
interpolating polynomial as shown by Egecioglu et. al. has 
been developed. It remains to be shown that this algorithm is 
practical (numerically stable). This, however, is left for 
future investigation. Both methods are competitive with each 
other since their input constraints and required number of pro- 
cessors are the same, and their time complexities differ only 
by very small constants for both interpolation and evaluation. 


With regard to the parallel Newtonian algorithm a small, 
but useful change is proposed that allows the use of the per- 
manence property of the Newtonian form. This change only 
increases the constant term in the time complexity of the 
algorithm and doubles the storage requirements from n+1 real 
values to 2(n+1). However, it decreases the required number 
of processors from (n+1)(n+2) to n+2. The permanence pro- 
perty is also demonstrated for the Lagrange form and is 
shown to require about half the time that the Newtonian form 
does, with only n+1 processors. 


The Lagrangian representation also has a property that 
may be valuable in certain applications. Once the J;’s have 
been calculated, the Lagrangian interpolating polynomial can 
be evaluated for different sets of data points f; without recal- 
culating the J;’s. While the same result can be obtained from 
the parallel Newtonian algorithm by modifying it so that with 
a slight increase in its storage requirements it can compute 
new divided differences in only [log (n+1)]+1, it requires 
the use of the full n(m+1) processors. Once computed the 
evaluation would only take another 2 [log (n+1)]+2. Thus, 
the whole process would be six arithmetic steps faster than 
simply evaluating the Lagrangian interpolating polynomial at 
new values f; with only n processors. 


Any problem which requires the computation of a large 
number of interpolating polynomials on a small number of 
points, and then requires the repeated evaluation of these 
polynomials at a large number of test points, without requir- 
ing these polynomials in classical form should benefit greatly 
from either of these algorithms. The nature of the application 
will determine which of the two methods is most suitable. 


While useful as a model, the shared memory PRAM 
architecture is not realistic as far as interprocessor communi- 
cation is concerned. Egecioglu et. al. provide an example 
implementation of their algorithm on a cube-connected com- 
puter. Because of the similarities between the two methods it 
should be possible to implement the Lagrangian algorithm in 
much the same way. A shuffle-exchange architecture might 
also be suitable since Schwartz [7] has shown how to perform 
the parallel prefix algorithm in O(log n) time. 
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In terms of circuits the binary addition (or multiplication) 
algorithm (circuit) can also be implemented using the parallel 
prefix algorithm (circuit). Thus all the circuits described may 
be uniformly implemented using just a parallel prefix circuit. 
The depth of these circuits remains the same, but the size of 
these circuits increases by a constant (multiplied) amount. 

Table 1 summarizes the results and compares them with 
the corresponding results of Egecioglu. et. al. 


[ie 


+2 [log m|—4 


m(n+1) 2 a] 


+ flog m] 


l<m <n 


Permanence 
ft +1)1+4 
Prmpeny | M2 | 2Roecwarte | estes 
2 [log (n+1) ]+2 3 flog n}+9 
Evaluation (tI 2 flog +1) 12. | 
2(n+1) |] 2 flog +1) 142 
Table 1 
Summary of Results 


n+l 


data points 
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Abstract -- This paper presents a new approach to the 
implementation of multidimensional (N-D) digital signal 
processing algorithms in a multiprocessor environment. 
We develop a computational primitive for N-D signal pro- 
cessing algorithms using a state space model. We then 
map the state space model onto a linear finite state 
machine and implement the computational primitive in the 
combinational logic block of the linear finite state machine. 
The state variables are stored in the delay elements of 
the machine. With our approach data communications 
requirements among processors have been minimized 
without increasing the computational requirements for the 
given algorithms. We present an efficient multiprocessor 
system using our approach for implementing N-D signal 
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Introduction 


Extensive research and development have been devot- 
ed to N-D digital signal processing over the last decade 
[1,2]. Practical applications include military intelligence, 
remote sensing, industrial inspection, robot vision, data 
compression for communications, processing biomedical 
images for diagnosis, character recognition, finger prints, 
weather forecasting, etc. These applications are computa- 
tionally intensive and require substantial data communica- 
tions. In many cases, the reduction of computer hardware 
cost makes the design of special purpose computer sys- 
tems tailored to the specific requirements of a given class 
of algorithms practical [3]. However, the complexity of 
most digital signal processing tasks is such that real-time 
implementation using single processor system will not be 
feasible in the near future [4]. In this paper, we have con- 
centrated on the development of algorithms which can be 
effectively used for high speed N-D digital signal process- 
ing in a multiprocessor or multicomputer environment. The 
data communication requirements for many digital signal 
processing algorithms are of the same order of magnitude 
as the computational requirements. In particular, the trans- 
fer of a data word between chips in a multiple chip system 
can require as much time as for a 16 by 16 integer multiply 
(typically on the order of 50 to 100 nanoseconds). Thus, 
data communication requirements should be given at least 
equal consideration in developing algorithms for multipro- 
cessor systems. 


Algorithm Decomposition 


A discrete, linear, shift-invariant (DLSI) system is a 
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discrete system for which the system parameters do not 
vary with changes in the independent variables (time, 
space, distance, range, etc.). Many practical digital signal 
processing and digital control problems can be represented 
as DLSI systems. Our approach is to design computation- 
ally efficient algorithms for N-D DLSI systems which can 
be implemented on a multiprocessor system with localized 
data communication requirements. We also emphasize the 
preference for on chip data communications compared to 
data communications between chips. 


In the 1-D case, partial fraction expansion or factoriza- 
tion can be used to partition the system into an efficient 
computational structure for a multiprocessor system. How- 
ever, N-D systems can not be partitioned in this way, 
except for ihe special cases of product or sum separable 
systems [5]. An arbitrary multivariate transfer function can 
not be factored into distinct poles and zeros and can not be 
expanded into partial fractions. These approaches to 
developing a parallel or cascade implementation of transfer 
functions can not be extended to N-D systems. Thus, we 
must explore alternate means of partitioning N-D DLSI sys- 
tems. | 


We use a state space representation as a vehicle to aid 
in the decomposition of N-D DLSI systems for implemen- 
tation on a multiprocessor system. The state space 
representation provides the potential for minimizing the 
data communications requirements for a given algorithm 
without increasing computational complexity. Other advan- 
tages of the state space implementation over direct imple- 
mentation include decreased sensitivity to parameter varia- 
tions and improved performance when finite arithmetic is 
used. In order to clearly explain the concepts involved in 
this approach, we first discuss the state space implementa- 
tion of 2-D DLSI systems. We then show that the concepts 
used in this special case can be extended to the N-D case 
(N > 2). 


State Space Implementation of 2-D DLSI Systems 


A set of finite difference equations is one of the forms 
commonly used for representing DLSI systems. A general 
order, causal 2-D DLSI system with quarter plane support 
can be represented by finite difference equations [1] as 
given by equation (1) 


LL Ls G& 
g(m,n)=)>) >) aQ.k)f(m—j,n—k)—}) >, bG.k)g(m-j,n—k). 
j=0k=0 j=0 k=0 
jt+k>0 


(1) 


An input-output relationship between the transform of 
the input sequence F(z,,z,) and the transform of the output 
sequence G(z),Z,) can be written as equation (2) 


G(Z,,Z>) = a(0,0)F(z,,2Z>) (2) 


L L 
+h Dd 
j=0 k=0 
j+k>0 


AG sK)F(z1,Z2) — bY,k)G(zZ1,2,) zi1z5*. 


Figure 1 gives a block diagram representation of the 
2-D DLSI system partitioned as specified by equation (2). 
Note that the number of vertical delays is the same as the 
order of the filter in the z) variable which is the minimum 
possible number. We can obtain the state space representa- 
tion by assigning a horizontal state variable to the input of 
each of the horizontal delay blocks (associated with z, vari- 
able) and by assigning a vertical state variable to each of 
the vertical delay blocks (associated with the z, variable). 
For convenience, we define 


(3) 


y(ny,n2) = qu(y—1,n2) + rL(ny,n2—1) 


Ci = a(j,k) = a(0,0)b(j,k); Cy = — bG,k). 


Then, the typical vertical state equation for the 2-D DLSI 
system can be represented by 


Ty(My,N2) = Cyyf(my,n2) + Cayy(ny.ng) + qy(ny—1,n2) 


(4) 


+ ry_1(ny,N—1). 


In a similar way, the typical horizontal state variable is 
given by 


qy(M1,M2) = cyyf(ny,ng) + Cayy(ny,M2) + qy(my—1,n9). = G) 
The output equation is given by 
g(n),nz) = a(0,0)f(ny,n2) + y(ny,n). (6) 


The equation for the vertical state variables as given in 
equation (4) is a computational primitive for the 2-D DLSI 
system since the vertical state variables, the horizontal state 
variables and the output can be mapped into this equation 
with a suitable interchange of variables. 


State Space Implementation of N-D DLSI Systems 


The general multivariable difference equation for the 
causal, discrete, linear, shift invariant (DLSI) system with 
first section support (the N-D equivalent of quarter plane 
support) is given by [1] 


Li Ly . 
g(n),...,.Nn) =)" eae Y a(jy,---In df, eee , NN-Jn) 
ji=0 jn=0 
L, Lyn 
2 7+ DS) bG1..InE(y ip... NIN). 
ji=0 jn=0 (7) 
jit Ss sh +jn>0 


The input f(n,....ny) is assumed to be sampled at uniform 
intervals in each of the independent variables and 
g(nj,....my) is the corresponding output. The parameters 
AQy..-.Jn) and b(j,....jn) are coefficients which determine 
the characteristics of the algorithm. Since the coefficients 
can take on arbitrary values as appropriate, this equation 
can represent many common N-D problems. 


We can extend the approach used for the 2-D DLSI 
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system to obtain a state space representation of the N-D 
DLSI system. Using the transform of the input sequence 
F(Z) and the transform of the output sequence G(Z), we 
can show the input and output relationship as follows: 


G(Z) = a(O)F(Z) 


L Ly 
+ > ve >S lack yFCZ)-B(K)GZ) Jax 
k,=0 ky=0 (8) 

kt +++ +ky>0 


Z = (21y.,2n); K = (Kyecky); ZK = (2+ + ay. 


We can extend the block diagram structure presented above 
for the 2-D DLSI system to N-D systems and obtain the 
desired state space representation by assigning a State vari- 
able to the input of each delay block in each tuple. After 
doing this the most complicated section will have a single 
delay element in each tuple. Other subsections may be 


equivalent or they may be simpler in that one or more of 
the delays may be missing. For convenience, we define 


N 
y(nj,..-My) = DOKL, (Mo 1,...) (9) 
k=1 


Ci = a(K) = a(O)b(K); Cz] = — b(K). 


Then, the typical state equation for the N-D DLSI system is 
given by 


qi 1(0},.-..My) = C7f(n,,...sNy) + Cozy (N},...,N) (10) 
N 
# Oi Gish ls) + D> driGe mes): 
k=1 
ki 
The output equation is given by 
g(nj4,....Ny) = a(O,...,0)f(ny,....ny) + y(yq,...,.Ny). (11) 


Equation (10) can be considered to be a computational 
primitive for the N-D DLSI system since the state variables 
for each tuple and the output can be mapped into it with a 
suitable interchange of the variables. Thus, N-D DLSI sys- 
tems can be implemented by solving equation (10) repeat- 
edly with proper parameter substitutions. Note that equa- 
tion (10) is a generalization of the 2-D computational prim- 
itive as given in equation (4). 


Multiprocessor System for N-D Signal Processing 


We now consider the implementation of DLSI systems 
in a multiprocessor environment. We can derive the compu- 
tational primitives for N-D signal processing from equation 
(10). These computational primitives then form the basis 
for the design of special purpose processors for the imple- 
mentation of the overall system. These computational 
primitives require two multiplications and N+1 additions to 
calculate a state variable or an output for a N-D DLSI sys- 
tem. We can use a tree structure to implement them with 
two multipliers and N+1 adders arranged in a pipeline and 
parallel fashion as shown in Figure 2. All inputs in Figure 
2 use shift registers or queues to hold appropriate data and 


system coefficients for the computation. The current N-D 
input data f(nj,...,njy) is stored in F register and the tem- 
porary value y(n,....ny) is stored in Y register for the 
current computation and replaced by new data for the next 


computation. The system coefficients c;; and c,, are stored 
_ in C, and Cs, respectively. The coefficients are arbitrary 
and can be changed by software. Therefore an arbitrary N- 
D DLSI system can be implemented using this structure. 
The Q,’s are the queues holding the previous state vari- 
ables for the current computation and are updated for each 
input. The pipeline will have [log,(N+4)] stages to calcu- 
late a state variable or an output and will produce a result 
at every cycle once all the pipeline stages have been filled. 
Since the computational primitives depend only upon the 
dimensions of the system and not upon the order of the 
system nor upon the size of the input data, we can design 
programmable special purpose processors to implement the 
computational primitives. 


There are (L+1)N -1 state variables to be calculated for 
each input data value for an N-D Lth order system. Thus, 
the computational requirements are too high for a single 
processor system or for a general purpose multiprocessor 
system to implement the DLSI systems in real-time at typi- 
cal sampling rates. In this paper, we consider a multipro- 
cessor system for the 2-D and 3-D cases. A spatial domain 
digital filter system for image processing will be presented 
as an example of 2-D signal processing and a motion pic- 
ture analysis system will be presented as an example of a 
3-D system. Higher dimensional systems can be developed 
using the same concept and the general computational 
primitive. 


A Multiprocessor System for 2-D signal Processing 


A multiprocessor architecture to implement the spatial 
domain digital filter in real-time has been proposed by Kim 
and Alexander [6]. The proposed system requires ten pro- 
cessors for a second order filter, each of which implements 
the computational primitive as given in equation (4). Each 
processor has two multipliers and three adders arranged 
in a pipeline and parallel fashion and can generate a state 
variable or an output at every cycle. These processors are 
connected in a linear array, as shown in Figure 3, where 
each processor is assigned to a separate row of data and 
the vertical state variables are passed to the next proces- 
sor in the array for use in computing the output for the sub- 
sequent row of data. In this scheme, the processor 
assigned to the first row is available to be assigned the 
eleventh row because it has completed the processing for 
the first row when the eleventh row becomes available. 
Thus the vertical state variables which are output from the 
tenth processor can be inputs to the first processor for use 
in computing the outputs for the eleventh row. 


This scheme can be extended to higher order spatial 
domain filters by increasing the number of processor in the 
linear array. For example, seventeen processors are 
required for a third order filter and twenty six processors 
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are required for a fourth order filter, etc. This multiproces- 
sor system is practical for implementing spatial domain fil- 
ters in real-time for images with 512 rows by 512 pixels 
per row at 30 frames per second. The time interval 
between pixels for such an image is 127 nanoseconds if no 
allowance is made for latency periods. Thus, a practical 
computation cycle time for a system design is 100 
nanoseconds. It is currently possible to design 16 by 16 
integer multipliers and adders with cycle times significant- 
ly less than 100 nanoseconds [7]. 


A Multiprocessor System for 3-D Signal Processing 


The above multiprocessor system for 2-D can be 
extended to 3-D signal processing. To implement 3-D 
DLSI systems each processor would be required to imple- 
ment equations with two multiplications and four additions 
in order to update any state variable or to compute the out- 
put. There are eighteen qi horizontal state variables, six 
q2 vertical state variables, two q3 frame state variables 
and an output to be computed for each input for a second 
order 3-D system. Considering the pipeline fill up time, 
twenty nine cycles will be required to calculate and update 
the output and all the state variables for a single input 
data for a second order 3-D system. Since outputs are 
generated at every 29th cycle, twenty nine processors can 
be arranged to achieve the throughput of one output per 
cycle for real-time signal processing. 


A multiprocessor system for second order 3-D DLSI 
system is given in Figure 4. Each processor is assigned 
to a separate row of data and the qz2 state variables are 
passed to the next processor through Q2BUF. In this 
multiprocessor system, the q3 state variables are trans- 
ferred between processors through Q3BUFFER with one 
frame delay. Q3BUFFER is a first-in-first-out type mem- 
ory buffer which could be partitioned by row for the row by 
row data processing operation. This multiprocessor sys- 
tem is practical for implementing 3-D digital filters in real- 
time for moving images with 512 rows by 512 pixels per 
row changing 30 frames per second. A higher order system 
can be built by adding processors and buffer memories. 


Conclusion 


We proposed a new approach to the design of a multi- 
processor system which can implement multidimensional 
DLSI systems in real-time. This approach also has the 
advantage that the complexity and number of computa- 
tions per input does not increase as the size of the input 
data is increased. Thus, very large multidimensional input 
data can be processed in near real-time with such a sys- 
tem. We are currently developing a special purpose digital 
signal processor and an associated multiprocessor system 
for the real-time implementation of 2-D DLSI systems 
using the approach proposed in this paper. 
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Figure 1. A block diagram of general order 2-D DLSI system. 
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Figure 2. Computational primitive for N-D DLSI systems. 


Figure 3. A multiprocessor system for a second order 2-D DLSI system. 
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Figure 4. A multiprocessor system for a second order 3-D DLSI system. 
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Abstract 


This paper presents the implementation and 
analysis of parallel depth-first search on the ring 
architecture. At the heart of the parallel formula- 
tion of depth-first search is a dynamic work distri- 
bution scheme that divides the work between dif- 
ferent processors. The effectiveness of the parallel 
formulation is strongly influenced by the choice of 
the work distribution scheme. In particular, a com- 
monly used work distribution scheme is found to 
give very poor performance on large rings( > 32 
processors). We present a new work distribution 
scheme that is better than the work distribution 
scheme used by other researchers, and gives good 
performance even on large rings (128 processors). 
We introduce the concept of iso-efficiency function 
to characterize the effectiveness of different work 
distribution schemes. 


1 Introduction 


Depth-First Search(DFS) is a general technique used in Artifi- 
cial Intelligence for solving a variety of problems in planning, 
decision making, theorem proving, expert systems, etc. [11,5]. 
It is also used under the name of backtracking to solve various 
combinatorial optimization problems and constraint satisfac- 
tion problems. Execution of a Prolog program can be viewed 
as depth-first search of a proof tree. Iterative-Deepening al- 
gorithms perform cost-bounded DFS in successive iterations to 
solve discrete optimization problems[4] and theorem proving[12]. 
A major advantage of depth-first search strategy is that it re- 
quires very little memory. Since many of the problems solved 
by DFS are highly computation intensive, there has been a great 


interest in developing parallel versions of depth-first search [3,14,6,2]. 


This paper presents the implementation and analysis of par- 
allel depth-first search on the ring architecture. At the heart of 
the parallel formulation of depth-first search is a dynamic work 
distribution scheme that divides the work between different pro- 
cessors. The effectiveness of the parallel formulation is strongly 
influenced by the choice of the work distribution scheme. In 
particular, a most commonly used work distribution scheme 
is found to give very poor performance on large rings( > 32 
processors). We present a new work distribution scheme that 
is better than the work distribution scheme used by other re- 
searchers [2,13,14], and gives good performance even on large 
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rings (128 processors). The performance is tested by solving the 
15-puzzle problem[1i1]. The ring architecture is embedded on 
an 128-node Intel Hypercube. We introduce the concept of iso- 
efficiency function (representing the required growth in problem 
size with respect to number of processors to maintain the effi- 
ciency) to characterize the effectiveness of different work distri- 
bution schemes. We have also implemented parallel depth first 
search on Hypercube and shared-memory architectures[9,10]. 
The ring architecture is important because it is simple to con- 
struct and is highly scalable. 

A detailed treatment of depth-first search is given in [11,5]. 
The unit of computation in a search algorithm is the time taken 
for one node expansion. The total time taken by a sequential 
search algorithm is roughly proportional to the total number of 
nodes it expands. Total number of nodes expanded by a search 
algorithm for a particular instance is called the problem size 
W of the instance. If the depth of the search tree is d, then the 
effective-branching factor b is defined as logyW. 


2 A Parallel of Depth-First Search 


We parallelize depth-first search by sharing the work done among 
a number of processors. Each processor searches a disjoint part 
of the search space in a depth-first fashion. When a processor 
has finished searching its part of the search space, it tries to get 
an unsearched part of the search space from other processors. 
When a solution is found, all of them quit. If the search space 
is finite and has no solutions, then eventually all the processors 
would run out of work, and the (parallel) search will terminate. 
Since each processor searches the space in a depth-first man- 
ner, the (part of) state-space to be searched is easily represented 
by a stack. The depth of the stack is the depth of the currently 
explored node, each level of the stack keeps track of untried 
alternatives. Each processor maintains its own local stack on 
which it executes DFS. When the local stack is empty, it sends 
a request for work to another processor. Each processor peri- 
odically checks for incoming work requests. If it has untried 
alternatives in the stack, then it sends some of them to the re- 
questing processor;! otherwise it sends a null message back. In 
our implementation, at the start of each iteration, all the search 
space is given to one processor, and other processors are given 
null space (i.e., null stacks). From then on, the state-space is 
divided and distributed among various processors. A detailed 
treatment of our parallel formulation can be found in [9,10]. 
In the first formulation we implemented, an idle processor 
requests for work only from an immediate neighbor. This is 


‘It is important to make sure that the work given out is not too 
small ( otherwise the requesting processor will be out of work soon 
again) or too large ( otherwise the donor processor will be out of work 
soon). The best strategy is to try to give nearly half of the local work. 


a simple and natural scheme and has been used by many re- 
searchers (see Section 6). Other work distribution schemes are 
possible, and will be considered later. 


3 Performance of Parallel DFS 


To test the effectiveness of Parallel DFS, we have used it to solve 
the 15-puzzle problem [11]. The 15-puzzle is a 4x4square tray 
in which are placed 15 square tiles. The remaining sixteenth 
square is uncovered. Each tile has a number on it. A tile that 
is adjacent to the blank space can be slid into that space. A 
game consists of a starting position and a specified goal position. 
The goal is to transform the starting position into the goal 
position by sliding the tiles around. The 15-puzzle problem is 
particularly suited for testing the effectiveness of parallel DFS, 
as it is possible to create search spaces of different sizes (W) by 
choosing appropriate starting configurations. IDA* is the best 
known sequential algorithm to find optimal solutions for the 
15-puzzle problem|[4]. It is much faster than simple depth-first 
search, as it is able to use the Manhattan distance heuristic 
[11] to focus the search. Since each iteration of IDA® is a cost- 
bounded depth-first search, a parallel formulation of IDA* is 
easily obtained. 


We implemented Parallel cost-bounded depth-first search 
(i.e., the last iteration of IDA*) to find all optimal solutions 
of the 15-puzzle problem on 1-ring and 2- ring embedded on 
an Intel Hypercube. On 1-ring (i.e., the unidirectional ring), 
a processor could ask for work from only one neighbor. On a 
2-ring, a processor could ask for work from both of its neigh- 
bors. We ran our algorithm on a number of problem instances 
given in Korf’s paper [4]. As shown in Fig 1, we are able to 
get linear speedup up to 16 processors, but for more processors, 
the performance is not very good. The performance of a 2-ring 
is better than a l-ring, but the maximum speedup obtained is 
only 25 even on 128 processors. In general, for a given number 
of processors, we get more speedup for bigger problems and less 
speedup for smaller problems. The size of a problem is deter- 
mined by it’s sequential execution time. The average execution 
time of the problems for which the speedups of Fig. 3 were 
obtained is roughly 200 minutes. On smaller problems (sequen- 
tial execution time 16 minutes), the maximum speedup for 128 
processors for 2-ring is approximately 10. But even for very 
large problems, we were not able to get speedups significantly 
higher than 25. It seems that parallel depth-first search with 
the simple work distribution scheme is incapable of effectively 
utilizing larger rings. The next section presents an analysis of 
this scheme which explains this poor performance. 


4 Analysis of Performance 


In this section we analyze the performance of parallel cost- 
bounded DFS. We assume that the effective branching factor of 
the cost-bounded search space is greater then 1+ e (where e is 
a positive constant). To avoid speedup anomalies, we assume 
that both sequential and parallel DFS search the whole cost 
bounded space for all solutions. All these conditions are met 
by the parallel formulation presented in Section 3. 
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Figure 1: Average speedup vs Number of processors for 
parallel cost-bounded depth-first search on a ring embed- 
ded in Intel Hypercube. Sequential Exec. time ~ 10500 
secs, problem size ~ 9 M nodes 


4.1 
1. 


Definitions and Terminology 


Running time Ty: is the execution time on N processors. 
T; is the sequential execution time. 


. Computation time Ty,:,: is the sum of the time spent 
by all the processors in useful computation. Since, both 
sequential and parallel versions search exactly the same 
cost-bounded space (to find all optimal solutions), 


Toate On N processors = Tygic on 1 processor = Ty 


. Communication time Teomm: is the sum of the time spent 
by all processors in communicating with neighboring pro- 
cessors, waiting for arrival messages, time in starvation, 
etc. For single processor execution, Tiomm = 0. Since, 
at any time, a processor is either communicating or com- 
puting, — 

Tcomm + Leate = N * In 


. Speedup S: is the ratio 7. 
. Efficiency E: is the speedup divided by N. E denotes the 
effective utilization of computing resources. 


_ 8 1 


- N 7 1+ = pe 


. Unit Computation time Uegic: is the mean time taken for 
1 node expansion. 


. Unit Communication time U.omm: is the mean time taken 
for sending a work request to a processor and receiving 
a response ( work or a null message). In the work dis- 
tribution scheme of section 3, Uczomm is a fixed constant 
(determined by the speed of the communication). 


—©) 


Figure 2: Linear Chain of processors 


4.2 Iso-efficiency Functions 


As discussed in the previous section, the efficiency obtained in 
parallel DFS is determined by the number of processors and 
the problem size.” For a given problem size W, increasing the 
number of processors N causes the efficiency to decrease be- 
cause Tomm increases while T.,;, remains the same. For a fixed 
N, increasing W improves efficiency because T7/- increases and 
the work distribution scheme with a-splitting does not cause a 
proportionate increase in Tiomm . If N is increased, then we can 
keep the efficiency fixed by increasing W. The rate of increase 
of W with respect to N is dependent upon the architecture and 
the work distribution algorithm. The required rate of growth 
of W w.r.t N (to keep efficiency fixed) essentially determines 
the scalability of the architecture for the work distribution al- 
gorithm. For example, if W is required to grow exponentially 


e 
ty 404 man nh ad mn nda 


architecture for 
a large number of processors. On the other hand, if W needs to 
grow only linearly w.r.t N, then the work distribution scheme is 


highly suited for the architecture. If W needs to grow as f(N) to 
maintain an efficiency E, then f(N) is the iso-efficiency func- 
tion and the plot of f(N) w.r.t N is the iso-efficiency curve. 
Next we derive the iso-efficiency function of parallel cost- 
bounded DFS for 1-ring. The analysis for 2-ring is similar and 
is left out. We present a theoretical model that give us bounds 
on total communication time Tomm in terms of problem size 
W and number of processors N for different work distribution 
schemes. Predictions from our model seem to closely agree with 
experimental data, hence we feel that the model is reliable. 


4.3  Iso-efficiency Analysis of the simple 
work distribution scheme 


Consider a linear chain of N processors of Fig. 2. A 1-ring is a 
linear chain with a fold back from processor N-1 to 0. In a 1-ring 
a processor can get work from its left neighbor and send work 
to its right neighbor. Initially W work is available in processor 
0. In order to achieve good work distribution every processor 
needs to get roughly i for itself?. Suppose that whenever work 
W is split between a donor and a requester, then the requester 
gets at most aW (for some constant a such that 0 < a < 1.0). 
Then 
Maximum piece of work coming into processor 0 is W 
Maximum piece of work coming into processor 1 is aW 
Maximum piece of work coming into processor i is a‘W 
From the above, we can see that in order to get ae work 


It is also determined by the architecture; eg., hypercube and 
shared memory architectures provide better efficiencies for parallel 
DFS [10]. In this paper we restrict our discussion to the ring archi- 
tecture only. 

’This is true only if the efficiency is high. Hence the analysis given 
here is not valid for “low-efficiency” iso-efficiency curves 
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x , 
Processor i has to get at least —4; transfers (2 > 0). 


N-1 
1 
Hence the total number of stack transfers > > ren 
<a 1, BN-1 1 
=— > s (where 8 = —) = + — 
Ns a B-1 N 
N_—1 
Leomm = Ucomm * an * + (lower bound) 


Leate = UcateW 
1 1 
Ef ficiency = ———_- =, —————————ar 
1+ ame 1 amie * aor 
For constant efficiency, 
1 


pr - 
OcaicNW = Ucomm a 
(since Ucomm and Ucaic are constant) 

Thus the iso-efficiency function is exponential.4 The Iso- 
efficiency function for 2-ring can be obtained similarly, and is 
also exponential. This explains the poor performance of 1-ring 
and 2-ring. 


BN 
1.€., W= UF) 


4.4 An Improved Strategy 


In the previous work distribution scheme, we restricted commu- 
nication to occur only between immediate neighbors of the ring. 
The analysis in the previous section clearly indicates a weakness 
due to this: the total count of stack transfers grows exponen- 
tially in a ring of processors because the size of the work pieces 
coming into successive processors decreases geometrically ( in 
the ratio 1,a,a”,...). To solve this problem, we designed the 
following work distribution scheme. 

In this scheme, we designate a special processor that selects 
the target for each requesting processor. This special processor 
maintains a variable I whose value denotes the next donor pro- 
cessor. Whenever a processor needs work, it sends a message 
to the special processor, which returns the current value of I 
and also increments it. The requesting processor now sends a 
request for work to processor I. 

This work distribution scheme appcars to have lots of over- 
head, as even to decide the identity of the next donor, a proces- 
sor needs to wait for O(N) time. Actual request for work and 
receiving a response again takes O(N) time. On the other hand, 
requesting work and receiving a response takes only a constant 
time in the simple scheme. But, as the analysis of the next 
section shows, the new scheme needs to make far fewer requests 
for work. Therefore it has a substantially better iso-efficiency 
function and speedup performance. 


4.5 Iso-efficiency analysis of the Improved 
Scheme 


Let € be the minimum amount of work transferable. (The ab- 
solute minimum amount of work transferable is one node ex- 
pansion. If we give out work only from levels that are above 
a CUTOFF depth, then € can be increased by increasing the 


4Since the value of Tromm used in the analysis is only a 
lower bound, the actual iso-efficiency function can be worse than 
exponential. 


CUTOFF.) We now present an upper bound on the number of 
stack transfers. 

Let us assume that in every V(N) requests made for work, 
every processor in the system is requested at least once. Clearly, 
V(N) > N. In general, V(N) depends on the load balancing 
algorithm. Recall that in a transfer, work (w) available in a 
processor is split into two parts (aw and (1 — a)w), and one 
part is taken away by the requesting processor. Hence after a 
transfer, neither of the two processors (donor and requester) has 
more than (1 — a)w work (assuming without loss of generality 
that a < 0.5). The process of work transfer continues until work 
available in every processor is less than €. Initially processor 0 
has W units of work, and all other processors have no work. 

It is easy to see that after (log 1 WYV(N) requests, maxi- 


mum work available in any processor is less than e. 
Hence, the total number of transfers < V(N log 1_*F 


Ww 
f Beery 6 ener V(N log 1 — (upper bound) 


Toate = caleW 


1 


Ef ficiency = ——+——_ 
- 17 “pons | 


1 
Ucomm*V(N)log ailirs 
1 a Ucalc* W 
For the improved work distribution scheme, V(N) = N, and 
Ucomm = O(N). Hence for iso-efficiency, 


W ~ O(N?)logW or W ~ O(N? logN) 


This iso-efficiency function is much better than G%. We im- 
plemented the scheme and tested it’s performance. As shown, 
in the Fig. 3, the speedups are substantially higher than the 
previous scheme. 


4.6 Finkel and Manber’s Scheme 


Finkel and Manber used a different work distribution scheme 
in their implementation of parallel depth-first search [2]. In 


their scheme, each processor maintains a local variable, target, 
to point to a donor processor. Target is incremented (modulo 


N) every time the processor seeks work. We can compute the 
iso-efficiency function of this scheme by following the method 
in section 4.5. For this scheme, V(N) = N? in the worst case.° 
But Ucomm is still O(N). Hence the iso-efficiency function can 
be as bad as O(N°logN). 

The superiority of our improved work-distribution scheme 
over this and the first scheme is clearly seen in the speedup 
curves of Fig 3. Initially our second scheme is slightly worse 
than the other two schemes due to the extra overhead of re- 
questing the value of target before requesting for work. But, 
for larger number of processors, our second scheme makes suffi- 
ciently fewer requests than the other schemes, and hence gives 
higher speedups. 


5 Related Research. 


Many researchers have implemented parallel DFS on the ring 
architecture and studied its performance for around 16—20 pro- 
cessors. Monien[13] and Wah[14] present parallel depth-first 

xarch procedures on a ring network. The work distribution 
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Figure 3: Average speedup vs Number of processors for 
parallel cost-bounded depth-first search on a ring embed- 
ded in Intel Hypercube. Sequential Exec. time ~ 10500 
secs, problem size ~ 9 M nodes 


schemes in these formulations is very similar to the first scheme 
presented in this paper. From the analysis of Section 4.3 (and 
our experiments) it is clear that this work distribution scheme 
is not able to provide good speedup on large rings. 

Manber presents an abstract model in [8] that captures the 
distribution of work being done in parallel depth-first search. 
For this model, Manber presents different work distribution 
schemes and computes lower bounds on the amount of inter- 
ference in a shared-memory system. (Part of the analysis pre- 
sented in Section 4.5 uses the same technique that Manber used 
for the analysis of interference). Manber’s analysis served as a 
basis for the design of parallel depth-first search scheme pre- 
sented by Finkel and Manber in [2]. This scheme has a bet- 
ter iso-efficiency function (O(N*log W) worstcase) for the ring 
than the simple work distribution scheme (see section 4.6). But 
this function is worse than the iso-efficiency function (N7logW) 
of the improved scheme in Section 4.4. Superiority of our im- 
proved scheme is clearly seen in the speedup curves of Fig 3. 


6 Conclusions. 


We have presented experimental and analytical evaluation of a 
number of work distribution schemes used in parallel depth-first 
search on the ring architecture. We found that the choice of the 
work distribution algorithm has a significant impact on the per- 
formance of the parallel depth-first search algorithm. We have 
introduced the concept of iso-efficiency function to characterize 
the effectiveness of different work distribution schemes. Table 
1 shows iso-efficiency functions of parallel depth-first search for 
different work distribution schemes. The development of the 
new work distribution scheme for the ring was motivated by 
the iso-efficiency analysis of the other two work distribution 
schemes. Even though, the new scheme appeared to have a lot 


>This result was proved by Manber[8] while analyzing the memory 
interference in shared memory architectures. 


Load balancing scheme 


Section2,Wah[14],Monien[13 
Finkel and Manber[2] 
The improved scheme (Section4.4) 


Iso-efficiency Function 


V 


N3logN 
N?logN 


Table 1: Iso-efficiency functions of different 
work-distribution schemes. 


of overhead, it had a better iso-efficiency function, and was ex- 
pected to perform better on larger rings. We were pleased to 
find that the experimental results were in close agreement with 
our theoretical results. 

The performance of parallel depth-first search is also greatly 
dependent upon the architecture. Our experimental results and 
iso-efficiency analysis shows that the hypercube and shared- 
memory architectures are significantly better than the ring. 
In[10] we also present a work distribution scheme that has al- 
most optimal performance on shared-memory /w-network-with- 
message-combining architectures (such as RP3[1]). The iso- 
efficiency function has been found useful in characterizing the 
scalability of many other parallel algorithms as well. 
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PARALLEL ALGORITHMS FOR ANSWERING THE TAUTOLOGY QUESTION 


Professor Gary D. Hachtel and Peter H. Moceyunas 
Department of Electrical and Computer Engineering 
University of Colorado, Boulder, CO 80309 


Abstract - This paper presents parallel multilevel and 1-level algorithms 
for answering the tautology question for Boolean Networks. These paral- 
lel algorithms recursively bipartition and assign work to available proces- 
sors. The results from the implementation on a network of Sun 3 worksta- 
tions, on an Encore Multi-Max multiprocessor and on an Intel hypercube 
computer are presented. A model for predicting speedups in a general 
divide and conquer setting is developed which includes communication 
delays and start up costs of the host multiprocessor system. This model 
accurately predicts speedups for problems with balanced recursion trees 
and can be used to characterize the parallelization potential of the host 
systems. For the specific context of Boolean tautology checking, this 
model can be used to show that unit parallel efficiency (i.e., linear 
speedup) is indeed possible, but that parallel efficiency is limited in the 
1-level algorithm by tree imbalance and by size. The addition of dynamic 
processor scheduling to overcome tree imbalance is presented, along with 
preliminary results. 
Introduction 


In the computer aided design of VLSI circuits, the designer may 
describe pieces of the digital system as a Boolean function (1-level func- 
tion) or as a set of interconnected Boolean functions (Multi-Level func- 
tion). Applications like test generation, logic minimization and 
equivalence checking, which use as input these Boolean functions, require 
an algorithm to check if these functions are tautologous. Note: 


A Boolean function or network of interconnected Boolean func- 
tions is tautologous if and only if all its outputs are equal to one 
for all possible combinations of its inputs. 


We will call an algorithm which determines whether or not a Boolean 
function is tautologous a tautology algorithm. Single level and mul- 
tilevel tautology algorithms have been previously studied by [4] , [6], [9], 
[7], [5] and [8]. These algorithms utilize Shannon cofactoring with 
recursive divide e onquer strategies. Due to the nature of the problem 


run times are O| q2%| , where q is the number of inputs to the Boolean 


function, although heuristics are often effective in reducing run times for 
certain problem classes. 


With the exception of [9], these previous tautology algorithms were 
implemented for serial processing. We have developed from a 1-level 
tautology algorithm [4], and a multilevel algorithm [7], corresponding 
parallel algorithms that use a recursive bipartitioning scheme to divide the 
work among processors. These new parallel tautology algorithms have 
been implemented on a network of SUN 3 workstations, on an Encore 
shared memory multiprocessor and on an Intel hypercube computer. With 
regard to purpose they are similar to that of [9], but differ with regard to 
domain of applicability and in method employed (cf., the conclusions sec- 
tion presented below). 


The paper begins by briefly presenting how parallel computations 
can occur in divide and conquer algorithms and then develops a model for 
expected speedups of the parallel algorithm for a class of divide and con- 
quer algorithms. The serial and parallel tautology algorithms using a 
static processor assignment are presented with further development of the 
speedup model specifically for tautology. This parallel speedup model is 
designed for the problem of Boolean Tautology checking. It is, however, 
generalizable with minor variations, to any of the following divide and 
conquer applications: nested dissection for linear algebraic equations, 
binary sorting, shortest path calculations, and VLSI placement. The 
results of the experiments run on the three computer systems are then 
given along with comparisons to the speedup models developed for both 
multilevel and 1-level tautology algorithms. Further improvements in run 
time for the parallel algorithms through dynamic processor scheduling are 
discussed followed by some conclusions and further work. 
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Divide and Conquer Algorithms 


Given a particular problem, a recursive divide and conquer algo- 
rithm will generate a set of recursive calls. This set of recursive calls can 
be represented in a tree data structure, where a node in the tree represents 
a recursive call to the routine. We will assume that the divide and conquer 
algorithm is a binary recursive algorithm. 


The manner in which a problem is divided and the size of the result- 
ing smaller problems depends on the nature of the particular divide and 
conquer algorithm. Algorithms which can divide problems into equal 
pieces will always produce a full recursion tree. Other algorithms, like 
tautology, attempt to create two equally sized problems but cannot 
guarantee it and thus unbalanced recursion trees may occur. 


Concurrency in Divide and Conquer 


An underlying implication in a divide and conquer algorithm is that 
the computational problem at each node of the recursion tree is indepen- 
dent of the computations at all other nodes on the same level in the tree, 
thus the work done in each node could be performed in parallel. If an 
unlimited number of processors were available, all nodes on each level of 
recursion could be processed concurrently. It is clear that in the worst 
case an exponential number of processors is required. 


Parallel processing is still possible with a limited number of proces- 
sors. Let N, the number of processors, be a power of 2. For each of the 
first logzN levels of recursion there are enough processors to process in 
parallel all nodes of a level. For each level 0 through logo, assign a pro- 
cessor to each node on the level. Each processor assigned to perform the 
work of a node at level log.N is also assigned the work of all the node’s 
progeny. Figure 1 shows a recursion tree and the processor assignment for 
a four processor system. Here, we assign a processor to each subproblem 
generated at level log2N in the recursion tree and let each processor find 
its solution through serial computation. Note that processor assignment is 
fixed relative to the recursion tree. Later, in the results section, we will 
discuss the use of dynamic processor scheduling. 


Clearly this strategy will have low parallel efficiency if the max- 
imum level of recursion is not greater than logoNV. It is also evident that 
as the maximum level of recursion (problem size) increases, the parallel 
efficiency approaches 100 percent for problems with balanced recursion 
trees. The details of how the parallel efficiency depends on the problem 
size are not immediately evident. This is the motivation for developing a 
model to predict speedup as a function of maximum recursion depth. 
With such a model, we can get some insight into how speedups will 
increase with problem size and determine the potential of this algorithm. 
Furthermore, parallel processing overheads, such as communications and 
startup costs, can be included in the model which can characterize the dif- 
ferent computer environments. 


Speedup Model for Divide and Conquer Algorithms 


Our development of the speedup model begins with the following 
observation: Not all divide and conquer algorithms produce a recursion 
tree with a predictable structure. This unpredictable tree structure inhibits 
the development of accurate timing models for all possible data inputs to 
such an algorithm [3]. However, an accurate model can be made if we 
restrict ourselves to those algorithms or sets of inputs to an algorithm 
which produce recursion trees with some predictable structure, i.e., which 
satisfy suitable constraints. In order to develop our speedup model, the 
following assumptions are made about the divide and conquer algorithm 
and the structure of its recursion tree: 


1) The algorithm produces a full binary recursion tree for any input 
data. 


2) All nodes on a particular level of the tree will perform the same 
amount of work. 


3) The amount of work performed by a node in the recursion tree is 
11+, where 7 is equal to times the 1 work done by its parent 
node ( 0<k <2 ) and y is some constant for the particular problem. 


In order to develop an accurate timing model, we assume communi- 
cations and process start up costs (i.e. delays) to be non-zero. Clearly, 
communications and start up costs are a function of the type of computer 
system being used. Let a be a parameter representing the way in which 
the characteristics of a computer system influence these costs. In general 
both costs depend on the characteristics of the particular divide and con- 
quer algorithm employed. They also depend on the level of recursion. Let 
A be a parameter embodying the properties of the particular algorithm and 
denote the particular recursion level in the algorithm by i. Then the com- 
munications and start up costs for a level i node in the recursion tree can 
be modeled by the functions C(a,A,i) and J(a@,A,i) respectively. For 
any given computer system, the communications delay between different 
pairs of processors can be different. Therefore the communications costs 
are also a function of the different processor pairs. It may also be possible 
that start up cost is influenced by the identity of the processor starting the 
process and the processor which the new process is being started on. 
Therefore let C,,,,(@,A,i) represent the communications costs for pro- 
cessor m to transmit data to processor n on level i+/ in the recursion tree. 
Similarly, let J,,,(a,A,2) represents the start up cost for processor m to 
start a new process on processor n on level i+/ in the recursion tree. 


We also define a quantity At(a,A ) which is an initialization cost. In 
the general case this parameter can be both a function of the particular 
aigoriihm and of the computer system. This cost represents the serial 
work a program may have to perform prior to the execution of the recur- 
sive divide and conquer part of the program. 


Before developing the timing models, we formally define speedup. 
The speedup due to N processors, Sy, is 


A 


Sn = iy? (1) 


where ¢; is the run time for a serial divide and conquer algorithm and ty is 
the run time for an N processor parallel divide and conquer algorithm. In 
order to obtain a speedup model, models for the serial and parallel run 
times must be developed. 


Serial Run Time Model 


Let / be the maximum number of levels of recursion in the tree. We 
define No + Y to be the amount of work performed at the level 0 node in 
the recursion tree (assumption 3). Following the algorithmic constraints 
given above, the amount of work done by any node on a level can be 
found in terms of No and y by recursively ap lying assumption 3. The 


amount of work at any node on level i is No 5 + y. There are 2‘ nodes 


at a level i and / levels in the tree, thus the fotal amount of work done or 


the serial run time is 
7 k! ; 
n= | nak) + 2 


This can be further simplified to 


ty = nok! +2 _ 1 : 


Finally, if there is some initial cost associated with starting the serial pro- 
gram on this computer system, then an additional cost At(a,A;) must be 
added to the serial run time, where A, represents the parameter embody- 
ing the properties of the serial algorithm. The serial run time becomes 


nenode +9[24-1) +Ar@Ay. Q) 


Parallel Run Time Model 


The parallel run time model is developed as follows. Assume, as in 
the serial run time model, / is the maximum level of recursion and No + Y 
is the amount of work done at the level 0 node in the recursion tree. Let 
the number of processors be N, such that N = 2/ for some je { 1,2,...}. 
The N processor run time can be determined by first calculating the run 
time for those levels in which the nodes are outnumbered by the 


processors, and then determining the run time while all N processors are 


being used. 
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Let u=min(/ ,logoV). During the first p-1 levels in the recursion 
tree the work to be counted to obtain the run time for the first p-1 levels is 
the sum of the work done by nodes along any path from the root to a node 
on level 1-1 , including the last node. This sum is explicitly 


[chy + 


—1 i 
nod(5) +BY. 


This simplifies to 
(3) 


The parallel work done in the remaining levels of the recursion tree, 
wt through /, is the sum of the work done to finish a 1 level node and all its 
progeny. This is similar to the sum found for the serial algorithm. 


» nay " 2 


=p 
simplifies to 


Tio i a 
my +24 p 1 : 


As in the computation of the serial time, a At(@,Ay) term is 
required to include any initial start up costs for the N processor program. 
Combining equations (3) and (4) and At(a,Ay) produces the concurrent 
run time ty, which is 


This model for the parallel algorithm does not include communications 
and process start up costs. To include these costs we observe that each 
node in the recursion tree from the root to level 4-1, where 
u=min(/ ,logzN) , will have a start up and communications cost associated 
with it. In the model for ty developed above (5), it was possible to 
assume that processing at each level of the recursion tree finished at the 
same time because of the assumptions 1 and 2. If either the communica- 
tions or start up costs are not equivalent for different pairs of processors, 
then this assumption becomes invalid. After level u-1, no more new 
processes are started and the time for each processor to complete its work 
is equivalent. However, processors may not begin processing the level 
nodes at the same time. The last processor to reach a level w1 node will be 
the last processor to finish processing. Therefore the overall run time is 
governed by this last processor. The sum of the communications and start 
up costs along a path from the root to a level pt node is the extra time 
required to reach that level : node. The path with the largest sum will be 
the last node to begin processing at the pt level. Given a node z on level 
u-1, the binary recursion tree follows a unique path L from the root node 
to z . Let Z be the set of all such paths for a particular set of p-1 level 
nodes. Each node in a particular path L is on a recursion level i and has a 
processor m assigned to it. A new process on some processor n will be 
started by processor m. Let p(L) be the set of all triples (m,n,i) 
corresponding to nodes on the path L. Therefore the delay due to com- 
munications and start up for an N processor algorithm is 


| ae 2, P| Cr wn (Oi An +n yn (ai An ) | 


Combining equations (5) and (6) the general model for the con- 
current run time now becomes 


in = wis 7 + Uy+ mye + [aie _ | + At (Ay ,O) 
+ max 
LeZ 


(4) 


5 +Uy+ YK a | +At(Ay,a). (5) 
=p 


max 
Lez 


(6) 


. < al in (Oi An tlm n (Oi AN ) | (7) 


Model for Speedup 


The speedup now can be formulated by substituting the equations 
(2) and (7) for t; and ty in equation (1). This yields 


nok’ n rf gis — | + At(a,Ay) 


+ y+ Wo SK + yj 241 — 1) + At (An a) 
i= 


+ max 
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Thus a general model for speedup has been developed. It can be easily 
specialized to a specific algorithm and computer system. This model is a 
function of A;, Ay, k, N, No, Y, @, / and initialization, startup and commun- 
ications costs. After the tautology algorithm has been presented, further 
simplifications will be made to the model which are due to specific pro- 
perties of the algorithm. 


The Tautology Algorithms 
In this section we will present both a serial and a parallel tautology 
algorithm in coarse detail. The algorithms for multilevel functions and 1- 
level functions are very similar. Since the 1-level algorithm is a special 
case of the multilevel algorithm, we will present the multilevel algorithm 
with notations where the 1-level version would differ. 


Serial Tautology 


The multilevel tautology algorithm described by [7] is a recursive 
divide and conquer algorithm. A high level outline of the algorithm is 
shown in figure 2. The term cover is defined to be the description of 
either a 1-level or multilevel Boolean function. A brief description of 
each routine will be given here. For a more complete description of the 
multilevel and 1-level algorithms see [7] and [4] respectively. 


The routine SPECIAL_CASES corresponds to the termination step 
of any divide and conquer algorithm [11]. The outputs of cover F are 
scanned: if any are set to 0 or all are set to 1, then a 0 or 1 are returned 
correspondingly. Otherwise a -1 is returned. In the 1-level algorithm, 
other attributes of 1-level cover are checked which can give an immediate 
answer to the tautology question. The 1-level algorithm then calls 
UNATE_REDUCTION [4] (not shown) if SPECIAL_CASES returns a 
-1. This routine checks the cover for special unate properties and if they 
exist allow the recursion tree to be trimmed. If an answer cannot be 
determined by SPECIAL_CASES (or by UNATE_REDUCTION in the 
1-level case), then an input variable is heuristically selected by the routine 
SELECT_SPLIT. Due to the nature of the tautology problem, an optimal 
choice of the splitting variable is not guaranteed by SELECT_SPLIT. 
Next the routine COFACTOR is called twice, which creates two new cov- 
ers by cofactoring F with respect to x; and x;. In the multilevel algo- 
rithm, the cofactoring procedure copies the cover F to create new mul- 
tilevel covers F,, and Fy and asserts the primary input x; toa 1 or0. The 


SIMULATE routine then propagates the value towards the primary out- 
puts through logic simulation. In the 1-level algorithm, this step is not 
necessary because the input functions are only one level. 
ML_TAUTOLOGY is recursively called using the two simplified func- 
tions. If either returns 0, then the original function F is not tautologous. 
Otherwise F is tautologous. 


The major differences between the 1-level algorithm and multilevel 
algorithm is that the simulation step is not required for the 1-level algo- 
rithm and the 1-level algorithm uses a unate filter. We will refer to the 
work done by the SIMULATE routine as simulation work and the work 
done by SPECIAL_CASES, SELECT_SPLIT and COFACTOR as 
simplification work. 


The Parallel Tautology Algorithm 


The parallel tautology algorithm is shown in figure 3. This algo- 
rithm is similar to the serial tautology algorithm up to the last call to 
COFACTOR. After the last call to COFACTOR, the algorithm branches 
to follow either a parallel processing path or a serial processing path, 
based on the number of processors and the current level of recursion. If 
the number of processors is greater than or equal to the maximum number 
of nodes in the tree at the next level of recursion, then the parallel pro- 
cessing path is taken. Otherwise, the parallel algorithm takes the serial 
processing path and behaves exactly like the serial algorithm. In the con- 
current processing path, a new process is started on another processor. 
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Let m be a label for the current process and n be a label for the new pro- 
cess on another machine. Next, process m sends the cover Fz to process 


n. Process m calls SIMULATE and then ML_TAUTOLOGY with Fy, as 


the input cover, and similarly process n calls SIMULATE and 
ML_TAUTOLOGY with F; as the input cover. Like the serial tautology 


algorithm, if either function is found not to be the tautology, then the 
cover F is not the tautology. If both are found to be the tautology, then F 
is the tautology. Note that the answer produced by process n must be sent 
back to its parent process m and in turn process m must wait for the 
answer from process n. The variable i] indicates at what level in the 
recursion tree a processor began processing. If the current level of recur- 
sion is equal to i/, then this indicates that the result from tautology should 
be sent back to it’s parent process. Otherwise, the answer should be sim- 
ply returned. The parameter parent contains the information regarding 
the identity of a process’s parent. Each process must know this informa- 
tion in order to send the tautology answer back to it’s parent. 


Speedup Model for Tautology 


Using the characteristics of the multilevel tautology algorithm and 
of our computer systems the specdup model developed previously (8) can 
be simplified. 


The work done by a node in the recursion tree for multilevel tautol- 
ogy consists of simulation and simplification work. The simplification 
work at the level 0 node for problems which result in full recursion trees 
is proportional to j2/, where j is the number of inputs in the cover. 
Further j is equal to /, the maximum level of recursion. The 
simplification work decreases at each level of recursion, thus it 
corresponds to the n term in assumption 3. We let 


No=/2'. (9) 


The simulation work can be roughly modeled as the number of functions 
in the Boolean network, f, times a computer dependent factor w(a). 
Again for a set of problems which produce full recursion trees, f is pro- 
portional to /, thus 

(10) 


Also, since the SELECT_SPLIT routine attempts to choose a variable 
which after cofactoring and simplification, produces two networks equal 
in size, the simplification work of a child node is 1/2 of that done by its 
parent. Thus we will let 


Y= @(a)/. 


k=1. (11) 

Approximate expressions for Cyn (Of An) »1mn (Ol An) At (0A 1) 
and At(a,Ay) are now given. For the computer systems involved, the 
Start up Cost is assumed to be independent of the nature of the divide and 
conquer algorithm Ay, the recursion level i, and the identities of the pair 
of processors involved, (m,n) . Thus we assume that start up cost depends 
only on a , the nature of the particular computer system being used. 
Therefore J » (0,2 Aw) is modeled as 


Inn (Oi An) = 5(Q), (12) 


for all m, n, i and A . The parameter s(q) , is the computer dependent 
scaling factor for startup costs. The distance between any pair of Sun 3 
workstations is approximately equal to the distance between any other 
pair. Also, the processors were selected on the hypercube such that com- 
munications occurred only between nearest neighbor processors. There- 
fore we assume that the communications costs between any pair (m,n) is 
independent of m and n, but proportional to the amount of data transmit- 
ted. The amount of data transmitted in the tautology algorithm is propor- 
tional to the amount of simplification work done by the node receiving the 
data, which decreases as 2. Thus the communications cost can be stated 
as 
k i+1 
Crea (A An) =C (%)No pa (13) 
for any level i—1 processor m which is transmitting data to a level i pro- 
cessor n. The parameter c(q) is a computer system dependent scaling 
factor for communications costs. The parameters At(o,Ay) and At(a,A1) 
are assumed to be constants which are only a function of the particular 
computer system. They are not a function of the number of processors 
involved and therefore we let 


At (a) = At(a,Aw) = At(a,A1), (14) 


where At(q) is the computer dependent initialization scale factor. Now 
substituting (9-14) into (8) and simplifying sums we get 


col (2!+1—1 +-At (a) 
l+14+ TH 
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for /<log2V. The speedup model has been reduced to a function of /, N, 
c (a), s(&) (a), and At(a). For a particular computer system, N, c (a), 
s(), O(a) and At(q) will be fixed and the multilevel speedup model will 
be only a function of /. Figure 4 shows a plot of speedup as predicted 
from the model versus /, the level of recursion, for N = 8. The plot con- 
tains a family of curves for different values of the startup cost scaling fac- 
tor s() given specific values for the communications and initialization 
scaling factors c (a) and At(a) and @(a)=0. Several observations can be 
made from this plot. Note as s(q) grows larger, the larger the problem 
(greater level of recursion) required to obtain a speedup greater than 1.0. 
As the problem size grows, the effect of the start up cost will become 
negligible and the speedups will be practically the same as for s(a)=0. 
Other plots using larger values of c (a), not shown here, produce curves 
with similar shapes but with decreased speedups as expected. In a similar 
manner it was found that the speedup model predicts that the initialization 
cost At (a) effects only those problems which have relatively few levels of 
recursion. As At (a) is increased the speedup moves closer to 1.0 for small 
values of J. As either / or s(q) is increased the the effect of Ar(a) 
becomes less significant. 


(15b) 


To obtain the speedup model for the 1-level algorithm, we set @(a) 
to zero, using the fact that there is no simulation work performed by that 
algorithm. 

Figure 5 shows a plot of speedup versus maximum level of recur- 
sion for N = 8, c (a) = .55, s(&) = 8000 and At = 100. The plot contains 
several curves, each with a different value for (a). As w(a) becomes 
larger, the predicted speedups for all problem sizes increases. Thus the 
multilevel algorithm is expected to see higher speedups than the 1-level 
algorithm. In fact, if the @(q) is large enough, linear speedups would be 
obtainable even for small problems. 

The Computer Systems 


The tautology algorithms were implemented on a network of eight 
Sun 3 workstations, a twenty processor Encore Multi-max and on a 32 
node Intel hypercube. The implementations on the Multi-Max and Hyper- 
cube used the mechanisms for parallel computations provided by Encore 
and Intel. Two software package developed at the University of 
Colorado, Boulder, DPUP [1] and GRAIL [10], were used in the imple- 
mentation on the Sun 3 network. The packages provide routines which 
which perform the necessary tasks for distributed processing. The 1-level 
and multilevel versions were initially implemented using DPUP. The 
dynamic scheduling versions of tautology used GRAIL. 
Initial Results 


This section presents a range of topics related to the data collected 
from the 3 computer systems. The description of the measured data, 
descriptions of the test input covers are given before the actual results are 
presented. 


Definitions 


Tests were done which would measure the real time required by the 
serial and concurrent tautology programs to produce an answer. On the 
Sun 3 network and on the Encore Multi-Max the run times include all sys- 
tem and communications time which result from processes being started 
and from data transfer. The run times from the Intel hypercube include all 
System and communications times from data transfer but not process start 
up times. The time to input the description of the cover was not included 
in the run time on any computer system. Each set of input data for tautol- 
ogy was run several times through the programs and the best/shortest 
times were used to calculate the speedup numbers. Program run times 
were measured on all computer systems at times when the systems had a 
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low probability of usage by other users. If other users were observed, 
then those measurements were discarded. 


Test Data 


The example 1-level covers in [4] were used for testing 1-level tau- 
tology. Multilevel covers generated by BOLD [2] (using the 1-level cov- 
ers as input) were used for testing multilevel tautology. Also, special tau- 
tologous 1-level were generated which have the following properties: 


1) A full binary recursion tree is produced when run through tautology. 


2) Whenever a recursion occurs the cover is split into two equally sized 
covers which are one-half the size of the original. 


3) The depth of the recursion tree is the worst case, i.e., the maximum 
possible for the number of inputs. 


The first and second properties fulfill the constraints of the speedup model 
developed previously. The final property indicates that for a particular 
number of inputs, the tautology algorithm will be required to do the worst 
case (i.e. the maximum amount) work. A set of similar multilevel covers 
were generated which have properties 1 and 3. Covers with these pro- 
perties are important because they match the model for speedup 
described previously which allows the empirical determination of the 
communications and startup parameters. The 1-level covers which 
have these properties are called covers of all minterms. The multilevel 
covers with these properties are created from XNOR of the outputs of two 
n-bit adder circuits and will be referred to as the adder circuits. 


Results from 1-level Tautology 


Variations in the run times for a given input (cover) were observed, 
especially on the Sun 3 network. Communications collisions are probably 
the reason this occurs and it is more observable on the Sun 3 network 
because its communications is much slower relative to its processor speed 
as compared to the Intel hypercube or Encore Multi-max systems. 
Further, larger variances in run times are observed between the results 
from tests using 4 and 8 processors on the Sun network. This again is 
most likely due to the increase in the number of processors which causes 
more messages, more message collisions and thus a larger variance from 
run to run. 3 


The maximum level and the total number of leaves in the recursion 
tree were recorded for each cover. Figure 6 contains plots of the speedup 
model for 1-level tautology (@(a)=0) developed previously and the speed- 
ups obtained with the covers of all minterms for the Sun 3 network, Intel 
hypercube and Encore multi-max (labeled Sun 3, Hypercube, MultiMax 
respectively). The Sun 3 network and Encore results were collected using 
8 processors and the Intel results using 32 processors. Values for c (a) , 
s(a) and At(a) that best matched the data were chosen (table 1). The 
model corresponds very well with the special covers of all minterms. The 
models using the same values for the 3 parameters but for different values 
of N also matched very well, but are not shown here. Figure 7 shows a 
plot of the speedup model and the speedups from the other covers from 
the Sun 3 Network using 8 processors. Similar plots for the Encore and 
Intel systems are very similar and omitted. 


As one would expect, the model applied to the three computer sys- 
tems produced very different values for the scaling factors c (a) , s (a) and 
At (a) (see table 1). The start up cost on the Sun 3 network is expected to 
be much larger due to the communications between the user program and 
the DPUP System [11] . The Encore computer system uses a fork call and 
has a much faster communications so it should have a less costly start up. 
The run times from the Intel system did not include processor start up 
times and therefore s(a)=0. The communications scaling factor c (a) is 
not nearly as different, but again is higher on the Sun 3 network as would 
be expected. These differences in the speedup and communications scal- 
ing factors account for the better speedups achieved by the Encore com- 
puter versus the Sun 3 network. 


Those covers which generate points below and to the right of the 
model’s curve in figure 7, produced recursion trees with less than 2! leaf 
nodes, where / is the maximum level of recursion. This indicates that the 
recursion tree was not full and therefore these covers do not satisfy the 
constraints of the model. The accuracy of the model can be increased 
across all problems by introducing the concept of the "effective level of 
recursion". A plot of speedup versus the effective level of recursion is 
shown in figure 8 for each of the 3 computer systems using the maximum 
number of processors available. The effective level of recursion is 


defined as the log2R , where R is the number of leaf nodes in the recursion 
tree . The model’s curve is also shown in these plots. Since the model 
assumes a full and balanced tree, the effective level of recursion is equal 
to the maximum level of recursion assumed by the model, and thus the 
model’s curve does not change. Notice that the data points have moved 
much closer to the model’s curve. By using the effective level of recur- 
sion we are attempting to model an unbalanced recursion tree by a bal- 
anced tree with a maximum level of recursion log2R , which is less than 
the maximum level of recursion for the unbalanced tree. It appears that 
this approximation works quite well. 


In several cases it was observed that a significant increase in 
speedup was not observed when the number of processors was increased 
for a given input cover. The speedup model reveals that this behavior is 
to be expected for those problems which recur only a few levels. Figure 9 
contains a plot containing a family of speedup model curves resulting 
from using different values of N and the same values for c (qa), s(a) and 
At (a) found for the Sun 3 network. There is a region where the models 
for different N processors are very close together and if the tautology 
algorithm achieves a maximum level of recursion in this region for a 
given input, no appreciable improvements in speedup will occur even if 
the number of processors is increased. The plots in figure 8 show that 
nearly all the data points are within this saturation region for the computer 
systems. The Encore Multi-Max and the Intel hypercube (plots not 
shown) have smaller saturation regions than the Sun 3 Network. This is 
due to the smaller (zero) start up scaling factor s(q) on the Encore Multi- 
Max (Intel Hypercube). The smaller the saturation region the larger the 
set of problems which will see speedup improvements as the number of 
processors is increased. 


Results from Multi-level Tautology 


The multilevel algorithm has only been implemented on the Encore 
Multi-Max and on the Sun 3 workstations. The results obtained from the 
parallel multilevel tautology on the Encore are very similar to those seen 
from the 1-level tautology program. The speedup model constants were 
chosen to give the best fit to results from the adder circuits described 
above. Table 2 shows the values of the computer dependent speedup 
model parameters chosen. Figure 10 shows a plots of speedup versus the 
maximum level of recursion on the Encore. Each curve represents the 
speedup model’s prediction for 2, 4 and 8 processors. The data points 
near each curve are the actual speedup obtained for the respective number 
of processors using the adder circuits described above. The speedup 
model matches the data very well. The value of s(q) for the multilevel 
algorithm is much larger than the value for the 1-level algorithm. This 
suggests that at least on the Encore machine that startup cost is not 
independent of the algorithm. The mechanism used to create a new pro- 
cess is the "fork". Since a fork must physically copy the program from 
one processor’s memory to another, it is clear that the time of a fork is 
dependent on program size. The multilevel program was 2.6 times larger 
than the 1-level program therefore it is understandable that the startup 
time would be larger. Figure 11 shows a plot of the speedups from other 
examples run through the multilevel tautology algorithm using 8 proces- 
sors versus maximum level of recursion . Most speedups are observed to 
be generally below the speedup model curve. All of the circuits which 
produced speedups less than what the model predicted were problems 
which did not produce full recursion trees. If, however, a plot of speedup 
versus effective level of recursion is made, there is a shift to the left of 
data points towards the speedup model’s curve. Figure 12 is a plot of 
speedup versus effective level of recursion from data obtained using 8 
processors and the values of the speedup model constants used in the pre- 
vious plot. As in the 1-level case the speedup model curve remains the 
same since it assumes a full recursion tree, but the data points are now 
Closer to the curve. 


The speedups obtained from the multilevel problems appear to be 
higher than similar 1-level problems. A cover of all minterms problem 
which recurs the same level as a adder circuit does not see as great a 
speedup as the adder circuit does even though they both produce a full 
recursion tree. This better performance by the multilevel algorithm is due 
to the extra simulation work done at each node in the recursion tree. The 
speedup model for the multilevel algorithm predicts near linear speedups 
for reasonable recursion depths, as compared to the 1-level speedup 
model which predicts linear speedups at depths greater than 100, beyond a 
practical problem size. 


137 


Improving Results: Dynamic Processor Assignment 

Clearly with this static processor assignment, problems which pro- 
duce unbalanced recursion trees always produced low speedups. Another 
version of multilevel tautology has been developed which dynamicly 
schedules the tautology work among the processors. This algorithm 
begins, as the previous algorithm, by assigning a processor to each sub- 
problem generated at each level of recursion up to level log2N . However, 
when a processor is finished, any processor which has available work will 
attempt to give part of its problem to it. In this way, large problems which 
are highly unbalanced (i.e. c432 ) can balance the load by using proces- 
sors which finish early. 


The multilevel algorithm using dynamic processor scheduling has 
been implemented on the Encore Multi-Max and on the Sun 3 Network. 
The Sun 3 network implementation uses the GRAIL distributed software 
package. 


The results obtained to date are encouraging. On a large problem, 
C432, a 36 input 7 output boolean network, very good results were 
obtained. This problem produces a highly unbalanced recursion tree. 
There are 2.2 million leaves in its recursion tree whereas a full tree would 
have a leaf count around 69 billidn. Table 3 contains the results. When 
the dynamic processor scheduling program was run through smaller 
examples, the results were not as impressive. In many trivial problems, 
degradation in speedups occurred. Since these problems take only a short 
amount of time to process (less than a few minutes), we are not too con- 
cerned with these problems. However, it should be noted that even with a 
large problem, there is room for improvement. For example, the c432 
result using 20 processors on the Encore is below 19. The cause of this 
less than linear speedup may be caused by a poor scheduling algorithm 
which results in too many small problems are being given to available 
processors instead of larger ones and the overhead is causing a reduction 
in speedup. Currently we are investigating the possibility of modeling 
processing time of a cover so that the scheduler can better deal out work. 
Early experiments suggest that this may be feasible. It is believed with an 
accurate model of processing time, both the small problems and the large 
problems will see significant improvements in speedup the dynamic pro- 
cessor scheduling algorithm. 

Conclusions and Further Work 


The parallel 1-level and multilevel tautology algorithms can be 
practically applied to both the distributed and multiprocessor computer 
environments. Covers which produce full or nearly full recursion trees 
produce the best speedups and these problems are the most time consum- 
ing for the serial algorithms. 


Both algorithms are designed for a distributed computer environ- 
ment. Each processor is given a substantial portion of work and there is 
not a tremendous amount of communications between processes. From 
the plots of the speedup model it is clear that the bipartitioning along the 
recursive call boundaries cannot achieve linear speedups for practical 
examples in the case of 1-level tautology. It would be interesting to 
investigate the possibilities of parallelizing the routines of the 1-level tau- 
tology algorithm. 


The computer system can have a substantial effect on the speedups 
obtained. It was observed that communications, startup and initialization 
costs will affect the results. Startup costs degrade speedup for problems 
with covers which only recur a few levels. Because of startup costs, prob- 
lems which recur only a few levels will see little or no increase in speed- 
ups if the number of processors is increased. Figure 3 shows a plot of the 
speedup models for several values of N which shows a region where pro- 
cessor saturation occurs ( between maximum level of recursion 1 and 6). 
It is apparent that the startup cost is quite substantial on the Sun 3 network 
and non-trivial on the Encore Multi-Max. Data taken from the Intel 
hypercube did not include startup costs. Thus these results indicate the 
type of improvements that could be obtained by reducing the the startup 
costs. Research in the area of reducing startup costs, especially on the 
Sun 3 network should be pursued. 


The speedup model for tautology has been developed under the 
assumption that startup and communications costs are not a function of 
the number of processors attempting to communicate simultaneously over 
the channel. However, as the number of processors increases, the number 
of message collisions increases. The results from the Sun 3 Network 
presented in this paper were obtained using at most 8 processors and the 
the effect of message collisions may have been very small. However, if a 
larger number of processors is used then this effect could become 


significant. It is not clear how well the model will extend as number of 
processors is increased. 


Finally, the model was developed in a general manner and should 
be easily applicable to recursive divide and conquer algorithms with the 
properties given previously (section titled Speedup model for Divide and 
Conquer Algorithms). The model can also be applied to a particular 
divide and conquer algorithm which has a subset of inputs which produce 
the properties stated above. The speedup model for tautology make 
several assumptions specific to the tautology algorithm and the computer 
systems (section titled Speedup Model for Tautology). Alterations to the 
communications and startup cost models will be necessary if these 
assumptions are not true for the particular recursive divide and conquer 
algorithm or the particular computer system. 


Our results have shown that almost linear speedups are available for 
problems which are of sufficient size and lead to full or effectively full 
recursion trees. Examples are the covers of all minterms in the 1-level 
case (c.f. asymptote figure 1), and the adder circuits, other data path cir- 
cuits ( multipliers, alu’s ), and c432 in the multilevel case. We note that 
while c432 recurred to the full depth of its number of inputs, it still had an 
imbalanced tree, and so required dynamic scheduling to achieve nearly 
linear speedup. This scheduling algorithm did not work well, however, on 
problems with limited effective recursion level, and we are now working 
on improvements to the scheduler which correct this deficiency. 


We have observed that dynamic scheduling results obtained from 
partially loaded computer system have produced acceptable run times. 
The scheduling algorithm appears to be robust enough to correctly shift 
work from the slower/loaded processors to the faster/unloaded processors. 
Although efficiency is degraded, the initial results indicate that the avail- 
able computing resources of the loaded system are being used efficiently. 
Future work includes a further study of the effects of uneven processor 
loading, which represents a more practical setting, on the tautology run 
times. 


We observe that in the context of multilevel logic minimization, [2], 
analogous tautology checking for c432 has been achieved in as little as 5 
seconds, as opposed to 8 hours. Yet, further tautology requests in the 
same context required 5-8 hours. This nonuniformity of resource 
consumption is a formidable obstacle to extending the size capability of 
multilevel minimization. Nevertheless, it does seem that all cases that 
lead to "large" cpu requirements also lead to linear speedups in the 
corresponding parallel implementations. Thus parallel processing does 
appear to offer the hope of directly extending the size limitations of mul- 
tilevel logic minimization algorithms, most of which are based on tautol- 
ogy checking or other divide and conquer type algorithms. 


In fact, the parallel tautology has been implemented in the mul- 
tilevel logic minimizer, MLMIN. Using this parallel MLMIN, we were 
able to finish the c432 problem in 10.1 hours on the Encore Multi-max 
using 16 processors versus 69.1 hours serially. 


We have at this time no direct way to compare our results to those 
reported in [9], since that work was done on a different computer, and 
used a different (although still divide and conquer) algorithmic approach. 
Further, that work was designed for a testing and logic verification (rather 
than for a logic minimization) context. Thus it was applicable only to a 
restricted class of Boolean networks (in which the individual functions 
were required to be typical primitive logic gates, such as NAND’s, 
NOR’s, XOR’s, etc). However, it does seem that the general conclusions 
Stated above are supported by that work as well. 
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Figure 3 
procedure ML_TAUTOLOGY ( F’, 1, N, parent, il) 
Inputs: 
F - is a multilevel cover 
i - is the current level of recursion 
N - is the number of processors in the computer system 
parent - is information about the parent process of this process 
il - Level in the recursion tree that this processor 
made its first call to TAUTOLOGY 


Output: Returns a 1 if the cover is a tautology otherwise returns 0 


r < SPECIAL_CASES( F ) 
if (7 4-1) return (7) 

j < SELECT_SPLIT (F ) 
F, <- COFACTOR( F, x;) 
Fy < COFACTOR( F , x;) 


/* Start a new process */ 

if (i < (log.N )—1) 
Start new process on another processor 
Send Fy to the new processor 


/* newparent is information of the new process’s parent */ 
Call SIMULATE (Fz ) on the new processor 


Call ML _TAUTOLOGY( F; =, E+1,N ynewparent ,i+1 ) 
F, < SIMULATE (F;, ) 
if ( ML TAUTOLOGY(F, ,t+1,N, parent, il)=0) 


a. if il 4i return (0) 
b. Send a0 back to the parent process. 
Wait for the child process to send an answer back 
a. if il #2 return the answer 
b. Send the answer back to the parent process. 


/* Perform Serial tautology */ 


else 
F< SIMULATE ( F,, ) 
z SIMULATE ( F. z, ) 
if ( ML_TAUTOLOGY(F,,, i+1,N, parent, il) =0) return (0) 
if (ML_TAUTOLOGY(F’ z, i +1N parent il) =0)) return (0) 
return(1) 


end procedure 
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Figure 2 
procedure ML_TAUTOLOGY (F ) 
Input: F - a multilevel cover 
Output: Returns a 1 if F is a tautology otherwise returns 0 


r < SPECIAL_CASES(F ) 

if (7 4-1) return (7) 

j <= SELECT_SPLIT( F ) 

F’, < COFACTOR( F, x;) 

Fy <- COFACTOR( F, x;) 

F’, <— SIMULATE( F,) 

Fy < SIMULATE( Fy ) 

if ( TAUTOLOGY(F, ) = 0 ) return (0) 
if (TAUTOLOGY(F z, ) = 0) ) return (0) 
return (1) 

end procedure 
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Abstract: Applications that use domain expertise often 
require repeated queries of large databases; these queries 
typically involve the determination of attributes through 
inheritance. We present a parallel inheritance algorithm 
using inferential distance that does not require any form of 
network conditioning. It achieves speedup both in the par- 
allel spreading of search activations within a single query 
and in the simultaneous processing of multiple queries. 
We also show that it is possible to significantly improve 
the performance of some specific networks by adding small 
amounts of additional processing capacity. 


1 Introduction 


Applications that use domain expertise — medical di- 
agnosis, mechanical design, and computer assisted instruc- 
tion, for example — often require repeated queries of large 
knowledge bases. In order to provide concise descriptions, 
these knowledge bases are usually organized as taxonomies 
of objects and classes of objects with their properties stored 
as high as possible within the hierarchy. This avoids a sub- 
stantial amount of redundancy but it means that queries 
must be answered using inheritance. Inheritance is the 
process of inferring properties about objects from group 
membership: a prototypical chair has four legs, so we 
can infer that a specific chair has four legs; all mammals 
are warm blooded, so we can infer that dogs are warm 
blooded. The efficient implementation of inheritance is 
critical to performance. 

Inheritance is complicated by exceptions; “Birds fly” 
but “penguins are birds” and “penguins don’t fly.” The 
most successful approach to dealing with exceptions [1] 


uses inferential distance, introduced by Touretzky [2]. With 


this approach, attributes are inherited from their “closest” 
ancestor according to a partial ordering: 


“The essence of this ordering is that an indi- 
vidual or class A is ‘nearer’ to class B than to 
class C’ iff A has an inference path through B 
to C.” (2, p. 12] 


For example, in Figure 1, Opus is nearer to Penguin than 
to Bird because there is a path from Opus to Bird that 
goes through Penguin. 


ANIMAL 


is-a 


BIRD (locomotion: fly) 


| is-a 


. PENGUIN 
is-a (locomotion: 


is-a 


OPUS 


swim) 


Figure 1: Sample Network. OPUS should inherit the value swim 
for locomotion because swim at node PENGUIN overrides fly at 
node BIRD. 


It was first suggested that inheritance algorithms using 
inferential distance could not be done in parallel [3]. Later, 
Touretzky showed that it could be done on appropriately 
conditioned networks [2] but the conditioning requires n? 
time [1] and must be repeated after each network modi- 
fication. We present a parallel inferential distance algo- 
rithm that does not require network conditioning. It has 
two sources of parallelism: the spreading activation of a 
single query and the simultaneous processing of multiple 
queries. 

In Section 2, we describe the parallel algorithm for an- 
swering a single query and give an experimental analysis 
of the speedup obtained. In Section 3, we extend the al- 
gorithm to permit the simultaneous processing of multiple 
queries and in Section 4, we consider the effects of adding 
additional processing capacity to alleviate bottlenecks at 
frequently accessed nodes. In Section 5, we summarize 
our results. 


2 Single Query, Parallel Inferen- 
tial Distance Algorithm 
A query to a knowledge base begins with the request 


for an attribute value and ends when all legal values sat- 
isfying that request have been sent to the originating node. 
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Our algorithm assumes an underlying MIMD message pass- 
ing model of computation and does not make use of any 
global state information. Each process contains the de- 
scription of a single concept and communicates with pro- 
cesses containing related concepts. We make no any as- 
sumptions about the transit time or the arrival order of 
communications.! 

The algorithm has two variants: in the first, processes 
do not have local memory other than that needed to store 
attribute values; in the second, they do have local memory 


available for status information. 


2.1 First Variant: No local memory 


The algorithm is initiated when a request for an at- 
tribute value arrives at an originating node. If that node 
has a value locally, it is returned; otherwise, the knowledge 
base is searched. As the search progresses, activity spreads 
through the network using PLUS messages meaning “Do 
you have any values?” and MINUS messages meaning “Any 
values that you have are overridden.” Both types of mes- 
sages contain the name of the originating node; PLUS mes- 
sages contain the requested attribute name as well. An- 
swers are returned to the originating node using FOUND 
and IGNORE 
messages meaning “Ignore any values that I’ve given you.” 


messages meaning “I have a legal value.” 


FOUND messages contain a located attribute value and the 
name of the node where it was found; IGNORE messages 
contain the name of their source node. 

Each node reacts to incoming messages as follows: 


e If a node without a value for the attribute receives 
either a PLUS or MINUS message, then a copy of the 
message is sent to each parent. 


If a node with a value for the attribute receives a 
PLUS message, then a FOUND message is sent to the 
originator and a MINUS message is sent to each par- 
ent. 


If a node with a value for the attribute receives a 
MINUS message, then an IGNORE message is sent to 
the originator and a MINUS message is sent to each 
parent. 


‘The final answer is computed by the originating node as 
the set of all “found” values minus the set of all “ignored” 
values. The originating node cannot complete its calcu- 
lations until all message activity within the network has 
terminated. In order to detect this, we use a notion of 
conservation of mass: 


!'Touretzky’s model differs in several ways. His algorithm does not 
accumulate the answer at a specific location but leaves it distributed 
throughout the network. In the case where two incomparable values 
are found, our algorithm will report both while his algorithm reports 
only one. In addition, Touretzky assumes an SIMD model of com- 
putation in which the system can detect the cessation of message 
activity. 
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A mass is attached to each message. If a node 
sends out m messages in response to a received 
message with mass in, each outgoing message 
is assigned a mass of in/n. Initial requests 
enter the system with a mass of 1. The origi- 
nator can terminate when it has received back 
a total mass of 1. 


A fifth type of message MASS is used to transmit mass from 
a node without ancestors to the originator. 

To see that the algorithm is correct, notice that the 
activation must reach every node on an ancestor path ex- 
tending from the originator: nodes up to and including 
the first node with a value on a path receive PLUS mes- 
sages and the remaining nodes on the path receive MINUS 
messages. If a node is on more than one path, it can re- 
ceive both PLUS and MINUS messages; if it does, the MINUS 
indicates the presence of a value that is on an inference 
path from the originating node. Thus a node with a value 
will receive only PLUS messages in exactly the cases where 
it has a valid value as defined by the inferential distance 
criteria. Further, the algorithm must halt for networks 
without cycles (inheritance hierarchies, by definition, do 
not have cycles): mass is always conserved and all mass is 
eventually returned to the originating node after reaching 
either a node with a value or a node without a parent. 


Example. If the node OPUS in the network shown in Fig- 
ure | receives a message requesting its mode of locomo- 
tion, the following activity should result (assuming unit 
time for operations and message transmissions): 


Time Unit 1: The initial message is sent to OPUS re- 
questing his method of locomotion. 


Time Unit 2: OPUS sends <PLUS, .5, locomotion, OPUS> 
to PENGUIN (a PLUS message with mass .5; the orig- 
inator is OPUS) and <PLUS, .5, locomotion, OPUS> 
to BIRD. 


Time Unit 3: PENGUIN sends <FOUND, .25, swim> to 
OPUS and <MINUS, .25, OPUS> to BIRD. BIRD sends 
<FOUND, .25, fly> to OPUS and <MINUS, .25, OPUS>. 
to ANIMAL 


Time Unit 4: BIRD sends <MINUS, .125, OPUS> to 
ANIMAL and <IGNORE, .125, BIRD> to OPUS. ANIMAL 
sends <MASS, .25> to OPUS. 


Time Unit 5: ANIMAL sends <MASS, .125> to OPUS. 


Time Unit 6: OPUS has received back a mass of 1 and 
calculates the answer is swim. 


Spreading activation requires a time proportional to 
the depth of the search tree rather than its size; thus, 
there is more parallelism available in “bushy” networks. 
Defining bushiness to be the average of the number of 


nodes in a search tree divided by its depth, the knowledge 
bases we have investigated have a bushiness value between 
2 and 3. The speedup achieved for the overall processing 
of a query is further limited by the need to accumulate 
the final answer in the originating node. While some of 
the accumulation can overlap with the search itself, the 
originating node is potentially quite busy, receiving mes- 
sages from every node on an ancestor path that either has 
a value or has no parents. 

Before presenting our experimental results, we describe 
the second variant of our algorithm in which message traf- 
fic (including that directed to the originating node), is 
reduced with the use of local memory. 


2.2 Second Variant: Local memory 


For this version, we add a single memory cell to each 
node that keeps track of the history of that node’s par- 
ticipation in a query. The cell can record one of three 
status values indicating that (1) neither PLUS nor MINUS 
messages have been received, (2) PLUS but not MINUS mes- 
sages have been received, or (3) MINUS messages have been 
received. The use of this information can reduce message 
traffic. Nodes with values that receive a second message 
need only send their mass back to the originator (possibly 
in an IGNORE message). Nodes without values do the same 
unless they receive a PLUS followed by a MINUS in which 
case the MINUS must still be propagated forward. If used 
for the query described above, for example, this algorithm 
would decrease the number of time units needed from 6 
to 5 and the number of messages from 10 to 8. 


We have simulated the performance of both variants 
of our inferential distance algorithm on three knowledge 
bases. The first, DATATYPE1, is an 80 node data type 
lattice. DATATYPEI has a single node that is the ances- 
tor of every node in the network. Since this is not typical, 
our second network, DATATYPE2, is the 79 node network 
that resulted from removing DATATYPE1’s top node; it 
has five top nodes. Our third network, CALL, is a 569 
node call graph that has been randomly seeded with val- 
ues. 

The results of our simulations are shown in Figure 2. 
We assumed that processes could respond to incoming 
messages in unit time and that direct connections were 
available to all processes for outgoing messages; we did 
not charge for communication overhead. Thus, our re- 
sults do not measure the parallelism that can be realized 
on a specific machine, but instead measure the parallelism 
inherent in the algorithm. 

The table shows the average time needed to answer a 
query and the average number of messages sent. It can 
be seen that the algorithm’s performance is improved by 
memory. Assuming the running time of the best sequen- 
tial algorithm for inferential distance is approximately the 
same as the number of messages sent using the second 
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Bush- Average Average 
Size iness Time Messages 
DATATYPE1 80 2.0 
without memory 8.33 19.45 
with memory 6.90 12.94 
DATATYPE2 79 2.5 
without memory 7.53 14.96 
with memory 6.35 10.81 
CALL 569 2.6 
without memory 8.51 21.06 
with memory T.AT7 15.06 


Figure 2: The effect of adding memory to the Parallel Inferen- 
tial Distance Algorithm. 


variant, we find that our parallel algorithm reduced the 
response time for single queries roughly by a factor of 2. 


3 Multiple Query Algorithm 


For many applications, it is necessary to provide si- 
multaneous access to a knowledge base for more than one 
user. Sharing a single copy of the knowledge base avoids 
the problems of maintenance and coherence and provides 
higher utilization. In this section, we demonstrate that 
our algorithm permits the simultaneous processing of mul- 
tiple queries without significantly affecting individual re- 
sponse times. 

Multiprocessing of queries requires a few modifications. 
Messages are tagged with a query id allowing processes to 
distinguish messages relating to different queries and orig- 
inating nodes are required to keep track of a number of 
sets of partial results. In addition, when the local mem- 
ory variant is used, all nodes are required to store status 
information for the active queries. Since processes have 
fixed size memory (cells for status information) and are 
not notified that a query has completed, memory is man- 
aged with an Least Recently Used policy. If this results in 
the destruction of information before the associated query 
completes, the algorithm performs correctly but may ex- 
ecute redundant computations. 

We have experimentally evaluated the processing of 
multiple queries on the three knowledge bases. We used a 
number of different memory sizes. Each experiment was 
performed with a fixed memory size, varying the arrival 
rate of queries. Queries were randomly selected from the 
set of possible queries and their arrival was controlled by 
a Poisson process. The effect of increased arrival rates on 
response time was observed. . 


Figures 3, 4, and 5 show the results of the experiment 
for the three knowledge bases. In each case, the y-axis is 
the average time to answer a query and the z-axis is the 
arrival rate normalized with respect to the rate at which 
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Figure 3: Multiprocessing of queries on the DATATYPE 
knowledge base. 


isolated queries are processed. Thus, for example, with 
a memory size of 2, 3 or 5, multiprocessing allows the 
arrival rate to increase substantially before response time 
is affected. 
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Figure 4: Multiprocessing of queries on the DATATYPE2 
knowledge base. Axes labeled as in Figure 3. 


As the arrival rate increases, the average time to an- 
swer a query increases. Eventually, the network reaches a 
saturation point at which incoming messages arrive faster 
than they can be processed and the average time to answer 
a query skyrockets. If we consider the (normalized) arrival 
rate just before saturation is reached as the mazimum ar- 
rival rate, we see that the three data bases achieve maxi- 
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Figure 5: Multiprocessing of queries on the CALL knowledge 
base. Axes labeled as in Figure 3. 


mums of 2.5, 4.5, and 13.5. The amount of improvement 
available depends on the structure of the network. The 
effect of increasing memory is twofold: the average time 
to answer a query is lowered and the maximum arrival 
rate is increased. Only small amounts of local memory 
are needed; a memory size of 2 or 3 appears sufficient. 

Examination of the simulation data showed the pres- 
ence of a few bottlenecks significantly degraded perfor- 
mance. In the DATATYPEI1 knowledge base, a single 
node was an ancestor to all nodes and acted as a serial 
bottleneck. In the next section, we explore the possibility 
of improving throughput by adding processing capacity at 
heavily utilized nodes. 


4 Utilizing Additional Processing 
Capacity 


If additional processing nodes are available for use, it is 
possible to alleviate congestion at heavily utilized nodes. 
As shown in Figure 6, heavily utilized nodes can be copied, 
partitioning their inputs. In the figure, C is copied so that 
queries originating at D and FE no longer compete for its 
processing capacity with queries originating at F’. 

To investigate the effect of additional processing ca- 
pacity, we used a simple processor allocation scheme. The 
system was run at its maximum arrival rate (just before 
saturation) to determine which node had the highest av- 
erage input queue length. This node was duplicated and 
its original inputs were partitioned so that the sums of the 
average utilizations of the nodes for each partition were as 
close to equal as possible. Figure 7 shows the results for 
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Figure 6: Example of node duplication used to reduce the 
processing requirements at heavily utilized nodes. 
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Figure 7: Speedup achieved by adding processing capacity to 
the DATATYPE! network. The y-axis gives the maximum nor- 
malized arrival rate and the z-axis gives the number of proces- 


sors. 


The figure shows the increase in maximum arrival rates 
as a function of the number of nodes in the network for 
varying amounts of memory. The initial additions achieve 
substantial speedups. Adding a single node, for exam- 
ple, increases the throughput almost two-fold; adding five 
nodes increases the throughput almost four-fold. This is 
because initially there were a few very heavily utilized 
nodes in the knowledge base. As more and more proces- 
sors are added, the utilization of the nodes in the system 
becomes more uniform and the speedup becomes linear. 
Note, however, that this speedup reflects the potential 
parallelism available in the algorithm; it does not take 
into account the communication overhead that would be 

2We present data for only one network. The results for 


DATATYPE2 are identical after the first few splits; the results for 
CALL show similar improvements. 
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present in a specific implementation. As the machine size 
increases, the effect of this overhead increases and, at some 
point, the performance of an actual system would degrade. 
Thus, we can divide the performance graph into three re- 
gions. Performance gains should be accomplished, in the 
first region, by expanding the existing network and, in the 
last region, by copying the entire network; performance 
gains in the middle region can be accomplished in either 
manner. 


5 Conclusions 


We have presented a MIMD algorithm for inferential 
distance inheritance that does not require network condi- 
tioning. It achieves speedup both in the parallel spread- 
ing of search activations and in the multiprocessing of 
queries. In each case, the parallelism available depends 
on the structure of the knowledge base. We have shown 
that it is often possible to significantly improve the per- 
formance of a specific network by adding small amounts 
of additional processing capacity. 

This work analyzes the parallelism available in our in- 
ferential distance algorithm; it does not analyze the per- 
formance of the algorithm on a real machine. Any actual 
implementation will be limited by the overhead of com- 
munication. We are currently developing strategies for 
allocating processes on specific architectures to determine 
the extent to which the available parallelism can be ex- 
ploited in practice. 


References 


[1] David W. Etherington, “More on Inheritance Hierar- 
chies with Exceptions,” Proc. AAAIJI-87, pp. 352-357 
(1987). 


(2) David S. Touretzky, The Mathematics of Inheritance 
Systems. Morgan Kaufman Publishers, Inc., Los An- 
geles California (1986). 


[3] David W. Etherington and Raymond Reiter, “On In- 
heritance Hierarchies with Exceptions,” Proc. AAAI- 
83, pp. 104-108 (1983). 


ASSOCIATIVE MEMORIES ON THE CONNECTION MACHINE 
Stephen D. Simmes and Charles J. Turner 

Science Applications 

5151 E. Broadway, Suite 900 

Tucson, Arizona 85711-3796 


Abstract 


We study the Hopfield associative memory on 
a Connection Machine. We derive a synchronous 
algorithm that allows up to 1600 memory patterns, 
each 16K in size, on a 16K processor machine. The 
memory recall time for such a setup is estimated to 
be approximately 8 minutes on a CM-l. An 
expression for the asynchronous energy change 
during an interation is derived which shows that 
the diagonal terms in the synaptic coupling matrix 
have an effect on the memory recall process. These 
terms are generally set to zero in_ studies of 
associative memory. We show that this may not be 
the best strategy. 


1.0 _Introduction 
There has been much _ recent interest in 
computer algorithms which model or _ emulate 


biological systems. Much of this work has appeared 


under the appelation neural networks’ which, 
generally, refer to large networks of 
communicating simple processors. Such systems 
perform computations exhibiting qualities of living 
Organisms such as learning, recognition, or 
interaction with a complex environment. One such 
algorithm which has attracted a wide base of 


interest is the associative memory as formulated by 
Hopfield (1982). 


The Hopfield associative memory consists of a 
fully interconnected network of N simple processors 
which are capable of computing a _ linear 
combination of scalar input values along with a 
nonlinear thresholding operation. Each processor 
can be in one of two states, Sje {-1,1}, and the state of 
the entire network is represented by an N- 
dimensional vector S=(S1,S2....,5N). The memory 
works according to the following dynamic updating 
algorithm: Each processor receives the current 
state values from all other processors which are 
multiplied by synaptic weights and summed to give 


the net input. The processor state value is then 
updated according to the sign of the net input minus 
a threshold. Two modes of processing are of 
interest, synchronous and asynchronous. In the 
asynchronous mode, the processors are updated 
without global timing whereas in the synchronous 
mode all processors are updated in unison. The two 


modes are not equivalent and may lead to different 
limiting behavior from identical initial states. 


If the synaptic weights are chosen 
appropriately the system is capable of exhibiting a 
memory property. In this case the weights store 
fundamental memory patterns S(K), k=1,..., M, and 


when M is not too large, each memory S(K) is a 
stable attractor for the dynamic updating algorithm. 
That is, if the system is initialized with a pattern in a 
sufficiently small neighborhood of one of the 
fundamental memories, the updating algorithm will 
converge to this memory as time increases. Thus, 
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International 


Corporation 


the system associates the stored memory with the 
noisy input version when the noise level (measured 
by the number of incorrect bits in the pattern) is 
not too large. This is a useful error correcting 
property in systems where the next stage of 
processing is sensitive to noise in the patterns. 
Q __ Associative Mem Algorithm 

The synaptic weight for the path connecting 
processor i to processor j is denoted by Jjj, It is 
assumed that Jjj=Jjj so that the matrix of weights J is 
symmetric. The dynamic updating rule for zero 
threshold may then be expressed , 


N 
S, (t+1) = sign » 3,5, 0 (1) 
j=l 


where sign(x) = + 1 if x = 0 and -1 if x <0. The state of 
the system at t = 0, S(0), is set to the noisy pattern. 
Equation (1) is then iterated until a stable behavior 
is realized. 


When the i-th processor is changed 
independently of the other processors, the global 


energy 
(1/2) y J,,8,S, 


ij=1 


E(S) =- 


is non-increasing whenever Jjj20. This is seen by 


the following computation. If AE(i) denotes the 
change in energy when the i-th state changes by 
AS; then 


Qo | s ; 
AE =-AS.| | pare J; Ss 
jJ= 


N 
Let V; (t) = > Ji S; (t) be the input to the i-th processor, 
j=1 
then Vj(t) = Sj(t+1) IVj(t) | and the change in energy 
can be written 


AB® = - 3, +1V, DI(1-S, (0 S, (t+1)) 
<OifJ,,20. (2) 

This implies that the asynchronous’ updating 

algorithm always converges to a steady state. In the 


synchronous mode, it is no longer true that the 
energy must decrease at every time step and the 
system converges to either a steady state or a 
periodic orbit. 

Consider’ the case where the fundamental 
memories are orthogonal vectors, that is 


k 
5 5 5 _ Ns 
i 1 kt. 


i=1 


In this case the weights may be defined by the 
correlation matrix 
M 
(k) ((k) 
I=) SS. (3) 
1j 1 J 
k=1 


This is the formulation used by Hopfield (1982) 
except we have put no restriction on Jjj (Hopfield 
assumed Jij = 0). Since Jii = M in (3) independent of 
the set of memorized patterns, it is clear that the 
diagonal terms play no role in the storage of 
information. However, from equation (2) we 
conclude that the diagonal terms do play some role 
relating to the stability of the patterns. This 
observation is reinforced by the fact that as Jjj>0 
increases without bound (holding the off diagonal 
terms fixed), equation (1) predicts Sj(t+1) = sign 
(JijSi@) ) = Sj(@t) and thus all patterns become stable. 
On the other hand, as Jii<O decreases without bound, 
it follows that Sj(t+1)=sign(JjjSj(t))=-Sj(t) and every 
pattern destabilizes into a flip-flop. 


Several recent papers have. studied the 
question of stability of fundamental memories 
(McEliece et al (1987), Newman (1987) and Komlos 


and Paturi (1987)). These authors study the 
correlation matrix memory’ using probabilistic 
methods when the fundamental memories are 


random vectors. The following results have been 
proved by Komlos and Paturi (1987); ag, as, pa, Ps are 
positive constants and distance between patterns is 
measured by the Hamming distance (number of bits 
which differ). 
Asynchron ing: If M<aaN and the 
initial pattern is within distance paN from a 
fundamental memory, then the system will 
converge to a Steady state within a distance 
Nexp(-N/4M) from the fundamental memory. 
When M<N/4InN the system will converge to 
the fundamental memory. 


Synchronou updating: If M<asN and the 
initial pattern is within distance psN from a 
fundamental memory, then in about In(N/M) 
steps the system will enter a region of radius 
Nexp(-N/4M) about the fundamental memory 
and remain there. When M<N/4InN the 
System will converge to the fundamental 
memory in Q(InInN) time steps. 


By means of numerical experiments, Hopfield (1982) 
showed for asynchronous updating that aag<0.15. 
Later, Amit, Gutfreund and Sompolinsky (1985) 
extended the analysis to a nonzero temperature (T) 
Monte Carlo process (the Hopfield dynamics 
correspond to T=0). They gave convincing 
arguments that there was a critical value ag=0.138 
above which no reliable memory was possible. A 
restriction on M is natural since when M gets too 
large the fundamental memories crowd one another 
and the basins of attraction begin to coalesce. In 
fact Komlos and Paturi (1987) have shown that there 
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are 0(2M ) stable extraneous memories stored by the 
correlation matrix memory, equation (3). 
Implementation on th nnection Machin 
The Connection Machine (CM) is a system of 
64K processors interconnected by a 16 dimensional 
hypercube network. The network is implemented as 
a 12 dimensional wired hypercube with 16 
processors at every node. Each processor is a simple 
bit-serial machine with 4K bits of local memory 
(CM-1). The CM is controlled by instructions sent 
from a front-end computer. Each instruction is 
broadcast to all processors in parallel making the 
CM a- synchronous’ machine. However, an 
instruction is executed only if a processor's context 
flag is set. Thus the context flag may be used to 
control the pattern of computation. Further 
information on the CM may be found in Hillis (1985). 


It is natural to implement the updating rule 
of equation (1) by assigning each state component 
S; to a processor in a one-to-one fashion. All 
processors then simultaneously execute a time step 
by computing the sign of the local input 


N 
vo=> J,, 8, ©. 


jel 
However, this method requires that the i-th 
processor have access to the N weights Jjj, j=l,..., N. 
Since these weights differ from processor to 


processor, there is no scheme by which they can be 
broadcast to all processors in N parallel steps. On 
the other hand, if these weights are stored in the 4K 
bits of local memory as 16-bit integers, then N would 
be restricted to a maximum of 256 processors, which 
is unacceptable. (The situation is actually worse 
than this since the CM uses local memory for 
dynamic stack allocation). A similar limitation 
arises if one attempts to assign weights to 
processors. 


This problem does not arise when the 
correlation memory is incorporated directly into the 
computation. Using equation (3) it is possible to 
rewrite equation (1) as 


(k) (kK) 
S, (+1) = sign > w (t) S. 
k=1 


(4) 
k) {k) 
W (t)=S(eS 
N 
«) 
“5,08 ° 
j=l 
Here the updating rule appears as a _ linear 
combination of fundamental memories weighted by 
the inner product of the current state with the 


fundamental memory. It is clear that this process 
will converge to a fundamental memory S(k9) 


whenever the fundamental memories are (more or 
less) orthogonal and the initial pattern S(O) is close 


enough to S(KO), In this case we find that w(ko) = N 
and W(k) = 0, k # kg so that S(1)=sign(NS(k0))=s (ko) 
and the algorithm converges in one time step. 


Each processor has the responsibility to keep 
track of one state component. Instead of storing 
weights, however, we now store the fundamental 
memories distributed across the processors so that 


processor i stores all memory components §j(K), k=1, 


. .M. The same inner products wk) (t), k=1, ....M 
are used in (4) for every memory component. The 
elements of the k-th inner product are computed by 


a simple one bit parallel multiplication Sj (1) $;(K). 


These results are then summed over all processors 
and rebroadcast using CM scan operations (Blelloch 
(1986)). Each processor then multiplies the k-th 
inner product times the k-th fundamental memory 
component and updates the sum appearing in 
equation (4). This process is repeated until k=M 
when the sign operation on the sum is executed. All 
memories are represented internally as_ binary 
numbers (Sj=0 or 1). This formulation uses about 2K 
bits of local memory for stack space, which leaves 
an upper bound of about M=2K memories in a 
network of N=64K processors. This corresponds to a 
capacity of a=M/N<0.0312 which is acceptable. 
4 Numerical Experimen 

Numerical experiments of the associative 

memory were run on a 16K _ processor Connection 


Machine (CM-1) located at Science Applications 
International Corporation, Tucson, Arizona. The 
fundamental memories consisted of 128x128 pixel 
images. A small number of the images M1=27, were 


selected from the USC Image Processing Institute 
data base (Schmidt (1977)). The original USC data, 
which is eight bit, was mapped into binary format 


by either a dithering process or a_ thresholding 
Operation, depending on the input scene. The 
dithering algorithm, which creates a_ halftone 


image similar to reproductions of pictures in daily 
newspapers, was found to work well on uncluttered 
high resolution images containing specific objects 
like faces, airplanes, etc. On the other hand, 
thresholding was found to work better on complex 
scenes like aerial photos, although the _ resulting 
quality was only marginal on such scenes. No 
attempt was made to optimize the visual quality of 
the fundamental memories. 


The full memory capacity for 16K processors 
is roughly 1600 images. Since our image data base is 


nowhere near this large, a set of randomly 
generated 128x128 patterns was used to simulate the 
remaining M-Myj fundamental memories. These 


patterns were generated by assigning pixels values 
-1 or +1 with equal probability. The patterns were 
not actually stored on disk but were created in the 
CM during the memory initialization phase. The 
fundamental memories were all stored = and 
referenced in local memory as one-bit binary data 
via the transformation {-1,+1} «= {0,1}. 


A single recall experiment consisted of the 
following steps. One of the fundamental memories, 
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k=kog was chosen as the test image. A noisy copy of 
the test image was made by flipping pixel values 
with probability Pyp<0.5. The iteration was then 
initialized by loading the noisy copy into S(O). The 
iterations were monitored by computing the 
distance between the current state S(t) and the test 


memory S(KO) as well as the distance between 
successive states S(t) and S(t-1). The distances were 
computed using the metric in equation (6) which 
gives the fraction of pixels which differ between 
two images. 


d(S,,S,) = wun > S159; fz 


(6) 
j=l 
Convergence was decided when the _ distance 
between two. successive iterations fell below a 
threshold. 
Figures 1 and 2 show the results of recall 
experiments for M=800 fundamental memory 
patterns, corresponding to a=0.0488. The noisy 


inputs were generated with noise levels at Py=0.3 so 
that approximately 30% of the pixels were flipped 
from their correct value. The iterations were 
carried out until successive iterations differed by at 
most two pixels. Convergence typically occurred in 
about six iterations. It was found that the updating 
algorithm was able to process about 20 memory 
patterns per second. This gives a cycle time of about 
40 seconds for an iteration and 4 minutes for 
location of the correct memory pattern in the M=800 
memory experiments. 


We tested the hypothesis that the diagonal 


weights Jjj, have a_= specific effect on the 
convergence of solutions of equation (1) (or 
equation (4)). In order to directly control the 
diagonal terms, equation (4) was. replaced by 


equation (7). 


S, (t+1) = sign > w? () si -(M-8)S.(t) | (7) 
k=1 


where the diagonal terms now are forced to take on 
the constant value Jjj=5. Setting 5=M regains the 
weights defined by equation (3) whereas 6=0 gives 
the Hopfield (1982) formulation Jjj=0. The inner 
products w(K) are still given by equation (5). It is 
interesting to note that for 65<M, the effect of the 
new term is a tendency to switch the sign of each 


Si(t). This tendency depends on the magnitude of 
the wk) (1) terms and is weak when M<<N. _ Several 
experiments were run with a fixed noisy input 


pattern and several values of 6 ranging from -M to 
+M. The results of these runs are summarized in 
Figures 3 and 4. As can be seen from these plots, 
there is a clear indication, for sufficiently large M, 
that as the diagonal terms increase from -M to +M, 
the convergence of the algorithm improves 
significantly. 


Figure 1. Results of experiment 
using a dithered image with M=800 
memories. (top) Original, (middle) 
original with 30% noise, and 
(bottom) reconstructed image. 
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Figure 2. Results of experiment 
using a binary image with M = 800 
memories. (top) Original, (middle) 
original with 30% noise, and 
(bottom) reconstructed image. 


0.03 


0.02 


Error after 5 Iterations 


0.01 


++ delta = -M 
© delta =0.0 
< delta=M 


Error between S(t) and Original 


Iteration 


Figure 3. Convergence results for M=800 memories. 
(a) Error after 5 iterations for various values of 


delta ranging from -M to M. (b) Error at each 
iteration for delta = -M, 0, and M._ Errors are the 
difference between the iteration value and _ the 
original, using equation 6. 
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Figure 4. Convergence results for M=100 memories. 


(a) Error after 2 iterations for various values of 


delta ranging from -M to M. (b) Error at each 
iteration for delta = -M,0, and M._ Errors are the 
difference between the iteration value and the 


original, using equation 6. 
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Abstract 


We present a novel technique for carrying out the 
automatic generation of an LR parser in a totally parallel 
fashion. The generation of an LR parser consists of con- 
structing a parse table, with one row per state (in a 
push-down automaton), and one column per terminal 
symbol. Traditionally, this is carried out row by row, in 
which case the computation of one row depends: on 
(potentially) all others. In contrast, our technique per- 
forms the computation by column. We show that the 
computation is totally independent for each column, mak- 
ing it ideal for parallelization. The speedup factor of the 
technique is min(N,M), where N is the number of proces- 
sors and M is the number of terminal symbols. 


1. Introduction 


LR parsing [6,15] is the technique of choice for gen- 
erating parsers for context-free languages. An LR parser 
is generated automatically from a context-free grammar, 
with a technique that consists of three phases. In the 
first, the LR(O) automaton is constructed. This automa- 
ton usually has a number of conflicts, which are of two 
types: shift/reduce conflicts and reduce/reduce conflicts. 
The second phase, lookahead computation, is invoked to 
resolve these conflicts. This consists of computing suit- 
able “FOLLOW?” sets for each nonterminal in the gram- 
mar. The most popular techniques for computing looka- 
head are SLR(1) (pronounced ‘Simple LR(1)’’) and 
LALR(1) (pronounced “Lookahead LR(1)’”). In the third 
phase a transition table is built, based on the FOLLOW 
sets computed in the second phase. This transition table 
is traditionally known as the ACTION table, and _ it 
encodes the appropriate move(s) (shift, reduce, accept, or 
error) for each state and each terminal symbol, in the 
basic form shown below. 


Py 


Pm 


Serial SLR(1) and LALR(1) algorithms [2,7,8,16,17,21,22| 
fill in the “‘reduce”’ entries of the ACTION table by row, 
i.e. a lookahead set is computed for each reduce move at 
each state. The lookahead set is the result of unioning 
certain FOLLOW sets which, in turn, are obtained by 
unioning other FOLLOW sets with certain First sets. 
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Thus the computation for one state (i.e. one row) 
depends, in general, on all the others. In contrast, our 
algorithms fill in the ACTION table by column. For 
each terminal symbol t (i.e. for each column), our algo- 
rithms use relations that describe how (1) t becomes a 
member of a First set, (2) how a member of a First set 
becomes a member of a FOLLOW set, (3) how a member 
of one FOLLOW set becomes a member of another FOL- 
LOW set, and (4) how a member of a FOLLOW set 
triggers a reduce move at some state. The computation 
of each column is completely independent from all other 
columns; thus one parallel processor can be assigned to 
the task of filling in each column. 


Previous work [3,4,5,9,10,11,18,20,23] on parallel pars- 
ing have focused on parallelizing the activity of parsing, 
rather than parallelizing the activity of parser genera- 
tion. Thus there exist languages and compilers that 
allow parallel computations, and allow the development of 
application programs that exploit the parallel architecture, 
but there are no environments suitable for the use of 
parallel processing in the development of a compiler, or 
one of its components, e.g. the parser. Giving the source 
code of an existing parser-generator to a parallelizing 
compiler will not dramatically improve the parser 
generator’s performance. At best, some of the parser- 
generator’s operations will be vectorized. Specifically, a 
popular parser-generator such as YACC [14] can be com- 
piled and run on a parallel computer, but it will use no 
more than one of the parallel processors, and thus will not 
exploit the machine’s parallel architecture. This is due to 
the algorithms used in parser-generators, which are 
inherently serial, as discussed above. 


In this paper we present a technique whereby the gen- 
eration of an SLR(1) parser can be carried out in a totally 
parallel fashion. The technique allows for the totally 
independent decomposition of the problem into as many 
subtasks as there are symbols in the terminal vocabulary 
in the language. Thus, if one has N parallel processors, 
and M terminal symbols, the speedup factor in the time 
required to generate the parser is min(N,M). The paral- 
lelization technique also applies to LALR(1) parsers [19], 
but we will omit that discussion for the sake of clarity. 


The remainder of this paper is organized as follows. 
In section 2 we present an overview of SLR(1) parsing. In 
section 3 we present a discussion of FOLLOW computa- 
tion. In section 4 we present the parallel algorithm for 
generating SLR(1) parsers, analyze the storage require- 
ments for each processor, and justify the speedup factor. 
Finally, in section 5 we present conclusions. 


2. Overview of SLR(1) Parsing 


We assume that the reader is familiar with context-free 
grammars and languages, and with shift-reduce parsing. 
We also assume some familiarity with LR parsers, in par- 
ticular the construction of the LR(0) automaton. There 
are many good texts that cover this material, such as 
[1,12,13]. We also adhere to the following notational con- 
ventions. 


A, B, C, ... nonterminal symbols 

blag DycO pigs terminal symbols 

Gog RV OZ terminal strings 

ae VD grammar symbols 

Qa, 2, .., W strings of grammar symbols 


€ the empty string 


Aw a production in a CFG 

=> right-most derivation 

First(q) {t | a=>'tx, for some x}. 

p, 4,r,8 states in the LR(0) automaton 


A context-free grammar that generates arithmetic expres- 
sions in a typical programming language is shown in fig- 
ure 1, along with its LR(O) automaton. In both the figure 
and the automaton, “‘L’”’ is the end-of-file marker. 


The LR(0) automaton has ten states, two of which have 
shift/reduce conflicts. In state 5, the parser cannot decide 
between reducing using production “E—+E+T”’ and shift- 
ing on ‘‘*’; in state 6 the conflict is between shifting on 


Figure 1. The LR(0) parser of a context-free grammar. 


“¥” and reducing using “E—>T”. The construction of this 
automaton constitutes the first phase of the generation of 
the parser. The second phase, namely lookahead compu- 
~ tation, is invoked to resolve the conflicts at states 5 and 6. 
In the SLR(1) technique, this consists of computing 
FOLLOW(A), where A is the left-part of the production 
involved in the conflict. FOLLOW(A) is defined as {t |S 
= "wAtz}. This is the set of terminal symbols that can 
appear after A in a sentential form, i.e. a string that can 
be derived from the start symbol S. In our example, we 
should compute FOLLOW(E) for both conflicts. We will 
discuss in detail the computation of FOLLOW sets, which 
are obtained from the grammar, in the next section. For 
now, the reader should be easily convinced that 
FOLLOW(E) = {+L}, since these are the only symbols 
that may legally appear after E in a sentential form. The 
FOLLOW sets are shown in the boolean table in figure 2, 
in which “‘e” in entry (A,t] indicates that teF OLLOW(A). 
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We shall see shortly that explicitly representing the FOL- 
LOW sets in this manner is unnecessary. Instead, one 
parallel processor will compute each column. 


FOLLOW(E)={+,)} 
FOLLOW(T)={+,*L} 
FOLLOW(F)={+,*,} 


Figure 2. A FOLLOW Table. 


Having computed the necessary FOLLOW sets, we 
proceed to the third and last phase, the generation of the 
parse table. As mentioned before, this is traditionally 
done by row, i.e. state by state. Each state is examined, 
and information regarding terminal transitions leading 
from it, as well as reductions available there, are encoded 
into the ACTION table. The procedure for filling the 
entries of the ACTION table is as follows. 


for each state q do 
for each transition q—s p do 
Add “S/p” to ACTION|g,t] 
{A “shift to p” move}; 
for each reduction on A—w do 
for each tEeFOLLOW(A) do 
Add “R/A—w” to ACTION|q,t] 


{A “Reduce using Aw”? move}; 


After these entries have been added, blank entries are 
filled as ‘‘error’’. The table should have no multiple 
entries. If it does, the grammar is not SLR(1). For our 
example, the ACTION! table is shown in figure 3. 


State/Symbol 


1 error error S/9 error 

2 S/4 error error — error 

3 Accept Accept Accept Accept 

4 error S17 S/9 error 

5 R/E->E+T S/7 error R/E—-E+T 
6 R/E-+T 8/7 error R/E-T 

i error error S/9 error 

8 R/T-+T#F R/T-TsF error R/T—-+TsF 
9 R/F--n R/F—n error R/F-n 

10 R/T—F R/T-F error R/T—-F 


Figure 3. An SLR(1) ACTION Table. 


Historically, there have been good reasons for filling in 
the ACTION table by row rather than by column. In 
practice, only a small portion of the states in the LR(0) 
automaton have conflicts; in our case there were two such 


1 We have neglected the nonterminal transitions in the LR(0) 
automaton. These are encoded into a separate table called the “GOTO” 
table, which is of no interest to us because the nonterminal transitions 
are entirely unaffected by lookahead. 


states out of a total of ten. This proportion (about 207) 
is in fact quite common. Thus, the table is traditionally 
filled by row so that states that have no conflicts (which 
are in the majority) incur in no lookahead computation. 


It is well known that some FOLLOW sets are subsets 
of others. In our example, FOLLOW(E) © FOLLOW(T) 
C FOLLOW(F). Thus FOLLOW(E) must be computed 
before concluding the computation of FOLLOW(T), and 
before computing FOLLOW(F). In principle, to obtain 
the FOLLOW set of some nonterminal A, one may have 
to first compute the FOLLOW sets of all the other non- 
terminals. Such is the case here for F. Because of these 
dependencies, the FOLLOW computation has not been 
deemed suitable for parallelization. The same occurs for 
the generation of the ACTION table. Allocating two 
separate processors for rows p and q of the ACTION 
table yields no significant advantage, since there may be 
reductions at p and q, on productions whose left-parts are, 
say, A, and Ay whose FOLLOW sets are not indepen- 
dent. 


Our approach to the problem is simply to consider fil- 
ling in the ACTION table by column, rather than by row. 
Doing so produces completely independent computation 
for each column. We must first discuss in detail the com- 
putation of FOLLOW, as shown in the next section. 


3. Computation of FOLLOW 


We begin by reversing the order of the “for” loops in 
the procedure shown earlier, so that the iteration on states 
is nested within the iteration on terminal symbols, rather 
than the other way around. The resulting equivalent pro- 
cedure is as follows. 


for each symbol t do 
for each state q do 
if q+ p is defined then 
Add “S/p” to ACTION(gq,t] ; 
for each reduction on A—w do 
if teFOLLOW(A) then 
Add “R/A—w” to ACTION|g,t] ; 


Our intent is to allocate a separate parallel processor to 
each individual terminal symbol. The processor will 
fill in the entire column for that symbol. This can be 
accomplished because of a critical observation: the re- 
formulated procedure requires knowledge of whether or 
not teFOLLOW(A) (a boolean value), for which the 
entire set FOLLOW(A) ts not necessary. This is in con- 
trast with the earlier procedure, in which the entire set is 
required because a for loop is used to enumerate the ele- 
ments of FOLLOW(A). Thus the core of the problem of 
parallelizing the generation of the parser lies in computing 
whether or not t is an element of FOLLOW(A), without 
explicitly computing FOLLOW(A) itself. We now show 
how this can in fact be achieved. We begin by breaking 
down the FOLLOW set into ‘Direct’? and ‘“Indirect’’ 
FOLLOW sets. 
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FOLLOW(A) = IFOLLOW(A) U DFOLLOW(A), 
IFOLLOW(A) = U {FOLLOW(B) |B—aAy, 7="€}, 
DFOLLOW(A) = U {First(X) |BaAyX6, 7 "e}, 
First(A) = U {First(X) | A—X6, 7 "e}, 
First(t) ae 


A symbol may follow nonterminal A directly, by appear- 
ing at the beginning of a phrase (7X6) that immediately 
follows A in the right-part of some production. On the 
other hand, a symbol may follow nonterminal A 
indirectly, by appearing in FOLLOW(B), where 
FOLLOW(B) © FOLLOW(A). The recursive nature of 
these equations suggests that terminal symbols propagate, 
from a First set to other First set(s), from there to FOL- 
LOW set(s), and from FOLLOW set(s) to other FOL- 
LOW set(s). The following relations capture this propa- 
gation. 


Definition: X ff A if there is a production A—~yX6 such 
that 7=>"e. i 


“ff” (pronounced “‘first-to-first’’) describes how one First 


set contributes symbols to another First set. Clearly, 
teFirst(X) iff t ff" X. | 


Definition: X fF A if there is a production B-aAyXx6é 
such that 7=>"¢. i 


“fF” (pronounced ‘“‘first-to-follow”’) describes how a First 
set contributes symbols to a direct FOLLOW set, and 
thereby to a FOLLOW set. Clearly, teDFOLLOW(A) iff 
teFirst(X) and X fF A, for some X. Furthermore, the 


composition of ff and fF describes all symbols in 
DFOLLOW(A), i.e. teDFOLLOW(A) iff t (ffofF ) A. 


Definition: B FF A if there is a production B—>aAy 
such that 7=>"e. mI 


“FF” (pronounced “follow-to-follow’’) describes how one 
FOLLOW set contributes symbols to an indirect FOL- 
LOW set, and thereby to a FOLLOW set. Clearly, 
FOLLOW(B) © FOLLOW(A) if B FF* A. Thus, the 
necessary and sufficient conditions for t to be in 
FOLLOW(A) are (1) t (ff'ofF) A, i.e. t follows A directly, 
or (2) t (fffFoFF") A, ie. t follows A indirectly. Factor- 
ing out relation (ffofF) from these two cases, we conclude 
that teF OLLOW(A) iff t (ffefFoFF") A. 


4. Parallel Generation of SLR(1) Parsers 


Having characterized SLR(1) lookahead symbols, a 
very simple algorithm for filling the entries of the 
ACTION table can now be formulated. The algorithm 
must compute the reflexive, transitive closure of both 
relations ‘ff’ and “FF”. During the computation of this 
closure, it is desirable to avoid repeatedly visiting any 
nonterminals. Thus we require two bit-valued structures 
“ff_Visited” and “FF_Visited”. The algorithm is 
presented below. 


Algorithm Compute_SLR_Action_Table: 
Input:LR(0) automaton, ff, fF, FF; 
Output: ACTION table; 


var 
ff_Visited: a bit. vector indexed by symbols; 
FF_Visited: a bit vector indexed by nonterminals; 


procedure Follow_to_Follow(A): 
begin 
if FF_Visited[A] then return; 
set FF_Visited|A]; 
for each B such that A FF B do 
Follow_to_Follow(B); 
end; 


procedure First_to_First(X): 
begin 
if ff_Visited[X] then return; 
set ff_Visited[X]; 
for each A such that X fF A do 
Follow_to_Follow(A); 
for each Y such that X ff Y do 
First_to_First(Y); 
end; 


begin 
for each terminal t do 
begin 
clear ff_Visited[X], for each symbol X; 
clear FF_Visited|A], for each nonterminal A; 


First_to_First(t); 


for each state q do 
begin 
if p=Go(q,t) is defined then 
Add “S/p” to ACTION|q,t]; 
for each (q,A—W) do 
if FF_Visited{A] then 
Add “R/A—w” to ACTION|g,t]; 
end; 
end; 
end; 


In this algorithm, procedure ‘‘Add’’ announces that the 
grammar is not SLR(1) if another move already exists in 
ACTION|g,t]. For each terminal symbol t, the ‘“‘Visited”’ 
_ vectors are cleared, and First_to_First(t) is called. This 
results in setting ff_Visited[X], and calling 
Follow_to_Follow(A), for all X such that t ff X, and for 
all A such that X fF A. The call to Follow_to_Follow(A), 
in turn, sets FF_Visited|B], for all B such that A FF’ B. 
Thus the net effect of calling First_to_First(t) is setting 
FF_Visited[A], for all A such that t (ffefFoFF") A. After 
this call, FF_Visited is “true” for those nonterminals 
whose FOLLOW sets contain t. After the call, each state 
in the LR(0) machine is examined, and both shift and 
reduce entries are made in the ACTION table, as 
required. 


It is critically important to note that the computation 
performed for each symbol t is completely independent of 
all the other symbols. Thus one parallel processor can be 
assigned to each symbol t. The parallel processors may 
share the memory in which the three relations ff, fF and 
FF are stored. Individually, each processor must 1) run 


its own copy of the algorithm, 2) have its own column of 
the ACTION table to fill in, and 3) have its own 
‘Visited’? vectors. In fact, the ‘Visited’? vectors are 
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more aptly named ‘‘t_is_in_First”’ and ‘“‘t_is_in_Follow’”’. 
The algorithm executed by each parallel processor is as 
follows. 


Algorithm Compute_SLR_Action_Table_t: 
Input:LR(0) automaton, ff, fF, FF, t; 
Output: ACTION: a vector indexed by states; 


var 
t_is_in_First: a bit vector indexed by symbols; 
t_is_in_Follow: a bit vector indexed by nonterminals; 


procedure Follow_to_Follow(A): 
begin 
if t_is_in_Follow[A] then return; 
set t_is_in_Follow|A]; 
for each B such that A FF B do 
Follow_to_Follow(B); 
end; 


procedure First_to_First(X): 
begin 
if t_is_in_First|X] then return; 
set t_is_in_First[X]; 
for each A such that X fF A do 
Follow_to_Follow(A); 
for each Y such that X ff Y do 
First_to_First(Y); 
end; 


begin 
clear t_is_in_First{X], for each symbol X; 
clear t_is_in_Follow|A], for each nonterminal A; 


First_to_First(t); 


for each state q do 
begin 
if p=Go(q,t) is defined then 
Add “S/p” to ACTION{q|; 
for each (q,A—w) do 
if t_is_in_Follow|A] then 
Add “R/A—w” to ACTION|q]; 
end; 
end; 


The storage requirements of each processor are quite rea- 
sonable: a bit-vector of length the number of symbols 
(tis_inFirst), another bit vector of length the number of 
nonterminals (t_is_in_Follow), and a vector of length the 
number of states (the ACTION table column for t). In 
the last of these, each element usually requires an integer 
to represent (in packed form) the corresponding shift or 
reduce move. 


One processor must be reserved to allocate tasks to 
the others, but the task allocation scheme is trivial. Any 
processor can be assigned to any symbol, as long as the 
ACTION table is re-constructed in a consistent manner. 
If there are N processors and M terminal symbols, the 
computation is sped up by a factor of min(N,M). The 
typical grammar that describes the phrase-structure of a 
programming language contains several hundred terminal 
symbols; hence our: approach can be profitably imple- 
mented on a wide range of parallel computers. 


5. Conclusions 


We have presented a technique for carrying out the 
automatic generation of an SLR(1) parser in a totally 
parallel fashion. The technique consists of computing (in 
parallel) one column of the parse table independently for 
each terminal symbol. We have shown that the speedup. 
factor is min(N,M), where N is the number of processors, 
and M is the number of terminal symbols. We have also 
shown that the storage requirements are reasonable for 
each processor. 


SLR(1) is one of the best known LR parsing methods. 
It is well known that LALR(1) is more powerful and more 
popular than SLR(1). Although not presented here for 
the sake of simplicity, the parallelization technique shown 
in this paper also applies to the LALR(1) method, as 
shown in detail in [19]. 
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Abstract 


The information processing task of abduction is to infer 
a hypothesis that best explains a set of data. A typical subtask 
of this is to synthesize a composite hypothesis that best explains 
the data set from elementary hypotheses that can explain vari- 
ous portions of the data. In this paper, we present a computa- 
tional model for concurrent synthesis of composite explanatory 
hypotheses that can be realized on a distributed memory, mes- 
sage passing, parallel machine. In this model, a process is as- 
sociated with each datum to be explained as well as with each 
elementary hypothesis that can explain some portion of the data, 
and the control of processing alternates between the data and hy- 
potheses’ processes. In each cycle of processing, the data and 
the hypotheses’ processes view the problem solving from their 
perspectives, and add to the growing composite explanatory hy- 
pothesis until a best explanation is synthesized. We analyze the 
time complexity of the concurrent algorithms, and discuss the 
architectural implications of the model. 


Abductive Inference 


Abduction is the very general information processing task of in- 
ferring a hypothesis that best explains a set of data [2,7]. Abduc- 
tion occurs, for instance, in diagnostic problem solving, where 
the data is in the form of manifestations (or symptoms), and the 
explanatory hypotheses are about component malfunctions (or 
diseases) [9,10]. A typical subtask of abduction is classification 
of the observed data onto stored elementary hypotheses. In sim- 
ple abductive problems, e.g., diagnosis under the single fault 
assumption, the classification subtask often yields elementary 
hypotheses that can individually explain the entire data. For 
such problems, the elementary hypothesis that most plausibly 
explains the data represents the best explanation. In general, 
however, an elementary hypothesis that can account for the en- 
tire data may not be available. Instead, a composite hypothesis 
has to be synthesized from elementary hypotheses that can ex- 
plain various portions of the data. Synthesis of composite ex- 
planatory hypotheses is thus another typical subtask of abduc- 
tion. 

Synthesizing a composite hypothesis that best explains a 
set of data can be computationally very expensive, especially in 
the presence of certain types of interactions between the elemen- 
tary hypotheses [1]. This suggests exploitation of concurrency 
in performing the synthesis subtask [8]. We have elsewhere re- 
ported [3] on a computational model for concurrent synthesis 
of composite explanatory hypotheses based on a shared mem- 
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ory, multiprocessor architecture. Our work on the “blackboard” 
model led us to think that a distributed memory, message passing 
parallel architecture may provide a more modular organization 
of knowledge and processing for the synthesis subtask. Thus 
in this paper we present a computational model for distributed 
synthesis of composite explanatory hypotheses. 


Characterization of the Synthesis Task 


Let D = {d,|i = 1,...,n} be a finite set of n observed data. 
Let H = {h,|j = 1,...,m} be a finite set of m elementary 
explanatory hypotheses. Let e be a map from subsets of H to 
subsets of D; e : 2% — 2”. We interpret e(H;) = D;, where 
H; C H and D; € D, as the explanatory coverage of H,, i.e. 
H; can explain all members of D;. Let V = {u,|k = 1,...,1} 
be a finite set of / discrete values. Let b be a map from H to V, 
i.e.,6: H — V. We interpret b(h;) as the prima facie belief 
in hypothesis h;. We may characterize the task of synthesizing 
a composite explanatory hypothesis by its input: D, H,e and 6, 
and its output: C’, where C’' is a subset of H that best explains 
D. The synthesis task is linear [1] if 


Whi, h; EH, (e(h,) WW e(h;)) = e({h,, h;}) 
and monotonic [1] if 
Vhi, h; EH, (e(h,) VU e(h;)) C e({hi, h;}) 


In this paper we consider only the linear version of the general 
synthesis task. We note that linearity of the task entails mono- 
tonicity. Thus we assume that the elementary hypotheses are 
non-interacting and offer explanatory alternatives where their 
coverages overlap. 


Characterization of the “Best” Explanation 


This characterization of the synthesis task is incomplete since 
we have not yet specified what is meant by the best explanation. 
A general operational characterization of the “best” explanation 
is based on the following three criteria: 

Maximal explanatory coverage of data: A hypothesis C, is a bet- 
ter explanation of D than another hypothesis C if e(C2) C 
e(C}) (strictly, if e(C2) AD C e(C) OD). Ideally, the synthe- 
sized C’ should provide complete explanatory coverage of D. 
Maximal belief in hypothesis: If e(C’;) = e(C2), then C; is a bet- 
ter explanation of D than C} if b(C’,) > b(C2) which denotes that 
Vd € D, VWh2 € C2 such that d € e(h2), dh; € C; such that 
d € e(h,) A b(h,) 2 b(h2). This criterion specifies that the ele- 
mentary hypotheses in C should be locally optimal in terms of 
their belief values. 


Minimal hypothesis: If e(C,) = e(C2) and 6(C) = 6(C>), then 
C; is a better explanation of D than C2, if C; C C2. This global 
optimization condition specifies that C should be parsimonious. 
We note the precedence relation between the three cri- 
teria according to which maximal coverage of the data has the 
highest precedence and parsimony of the composite hypothesis 
has the lowest. We note also that depending on the maps e and 
b, the synthesis task may be underconstrained, in which case the 
synthesized explanation would only be a “best” explanation. 


A Concurrent Model of the Task 


Decomposition of the Synthesis Task 


The task of synthesizing a C that “best” explains D may be 
decomposed into two phases: generation, and testing. In the 
generation phase, a C’ that achieves the goal of complete ex- 
planatory coverage of the data and satisfies the local constraints 
on the choice of elementary hypotheses may be generated. In 
the testing phase, the generated C' may be tested for the global 
constraint of parsimony and if possible, further optimized. The 
generation phase itself is further decomposable into three steps 
corresponding to the three types of hypotheses that need to be 
included in C’. In the first step, the h; € H that are neces- 
sary for explaining d; € D may be found by a specialized form 
of means-ends mechanism which views explaining each d; as 
a subgoal of the synthesis of C. A hypothesis h, is necessary 
for explaining some datum d if there exists no other h, that can 
explain the d. We denote this set of hypotheses as Crgcential- 
In the next step, the h; € HA that are the best explanations for 
some d; € D—e(CRgcentia]) are found by the same means-ends 
mechanism, where a hypothesis h, is clearly the best explana- 
tion for some datum d if its belief value 6(h,) is above some 
high threshold 6, and there is no h2 that can explain the d and 
has a belief value above 8. We denote this set of hypotheses as 
CFirm: nthe final step, the h; € H that are needed for explain- 
ing d; € D-e(CEgsential)—e(C Firm) are found; we denote this 
subset of components as Cy,,. The composite hypothesis gener- 
ated is given by C = CE gcential V CBign VY CIp- 

The hypotheses in CEs sential» CRirm and Cy, differ from 
each other in the firmness with which they are included in C. 
The Jn components are only weakly included in C, and should 
to be tested for explanatory superfluousness, where a hypothe- 
sis is explanatorily superfluous if removing it from C does not 
reduce e(C’). The Essential components are strongly included in 
C’, cannot be removed from C’ without reducing e(C’), and need 
not be tested for explanatory superfluousness, Similarly, by the 
manner in which Cp 4 was generated, the Firm hypotheses are 
firmly included in C,, cannot be removed without reducing 6(C) 
(relative to other composite hypotheses), and need not be tested 
for explanatory superfluousness. 
Concurrency in the Synthesis Mechanism 


It is clear that certain aspects of synthesizing C’ are inherently 
sequential, e.g., the testing of the composite hypothesis for par- 
simony should serially follow the generation phase, and even 
in the generation phase, the sequential order of the three steps 
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should be maintained. However, there also is significant con- 
currency in the process. There are really two types of questions 
that are raised during the synthesis of C. The first type is from 
the perspective of each datum d; € D and is of the form “which 
hypothesis h; € H can best explain me?”. This question can 
be asked and answered for each datum concurrently with oth- 
ers. The second type of question is from the perspective of each 
hypothesis h; € H and is of the form “which elements of D can 
I be used to explain?”. Again, this question can be asked and 
answered for each hypothesis concurrently with others. 

This suggests that a process be assigned to each d; € D 
as well as to eachh; € H. Let P = {p,|t = 1,...,n} bea 
set of n processes, one for each d; € D. Each process p; € P 
represents the perspective of the corresponding datum d; € D in 
the synthesis of C. The P processes use identical algorithms and 
can execute concurrently. Similarly, let Q = {q,;|7 = 1,...,m} 
be a set of m processes, one for each h; € H. Each process q; € 
Q represents the perspective of the corresponding hypothesis 
h; € H. Again, the Q processes use identical algorithms and 
can execute concurrently. 


A Distributed Mechanism for the Task 


In distributed synthesis of composite explanatory hypotheses, 
each process has access only to its own local memory. Commu- 
nication between the processes occurs by message passing only; 
there is no global updating of shared variables. Synchronization 
between the processes can be achieved by adopting the frame- 
work of Communicating Sequential Processes (CSP) [6]. In 
CSP, communication between processes occurs when one pro- 
cess names another as the destination and the second process 
names the first as the source. Synchronization is achieved by 


delaying the execution of an input or output command until the 
other process is ready with the corresponding output or input. 

The messages between the P and Q processes are se- 
mantically encoded and the response of a process to a message 
depends on the semantics of the message it receives. For exam- 
ple, if process p; € P corresponding to datum d; determined 
that that some h,; € H was necessary for explaining d;, then 
it sends an Essential! message to process q,; € Q representing 
h,. On receiving this message, q,; sends an Explained! message 
to the processes p; € P corresponding to d; € D that h,; can 
explain. In addition to messages of this type, the processes are 
allowed to send (and receive) Null messages. 

The control of processing alternates between the P pro- 
cesses and the Q processes. In each cycle of processing, when 
the P processes are executing the Q processes are idle; when 
the P processes have finished executing some step, they com- 
municate their results to the Q processes, and the Q processes 
can start executing. Similarly, when the Q processes are execut- 
ing the P processes are idle; again, when the Q processes have 
finished executing some step, they communicate their results to 
the P processes, and the P processes can start executing. Thus 
in each cycle of processing, the P and the Q processes view the 
synthesis task from different perspectives and add to the grow- 
ing composite explanatory hypothesis. After three such cycles 


that correspond to computing CRggential> CFirm 294 Cin te- 
spectively, a composite hypothesis C that achieves the goal of 
complete explanatory coverage and satisfies the local constraints 
on the choice of elementary hypotheses is generated. The gen- 
erated C is now tested for the global constraint of parsimony in 
a similar manner. 

Distributed Hypothesis Generation 


We now present concurrent algorithms for distributed genera- 
tion of C’; detailed concurrent algorithms written in CSP are 
given in [4]. Since the n processes in P use identical algorithms, 
and the m processes in Q also use identical algorithms, it suf- 
fices to describe the processing from the perspectives of a datum 
d € D represented by a process p € P and an elementary hy- 
pothesis h € H represented by a process q € Q. The process q 
initially has information specifying the hypothesis h that it rep- 
resents, the explanatory coverage e(h) of the hypothesis, the be- 
lief value b(h) of the hypothesis, and the data D to be explained. 
Similarly, the process p initially has information specifying the 
datum d that it represents, the cardinality of H. 

In the first cycle of processing, the Essential components 
are identified. The processing begins when process q sends an 
Essential? message to processes in P corresponding to data 
d; € e(h), and Null messages to other processes in P. Pro- 
cess p receives messages from processes in Q. Now from the 
perspective of d, one of three things can happen: 

(i) p receives no Essential? messages. Then d is Unexplainable, 
and p does nothing; 

(ii) p receives exactly one Essential? message. Then the hy- 
pothesis corresponding to the process in Q from whom p re- 
ceived the message is necessary for inclusion in C’, and p sends 
an Essential! message to that process; 

(ill) p receives more than one Essential? message. Then p sends 
a Null message to the processes in Q from whom p had received 
the Essential? messages. 

Process q receives messages from processes in P corre- 
sponding to d; € e(h). In the second cycle of processing, the 
Firm hypotheses are identified. At the end of the first cycle, 
from the perspective of h one of two things can happen: 

(i) q receives at least one Essential! message. Then q sends an 
Explained! message to processes in P corresponding to d; € 
e(h); 

(ii) g receives only Null messages. Then q sends Firm? mes- 
sages, along with the belief value b(h) of the hypothesis it rep- 
resents, to processes in P corresponding to d; € e(h). 

Process p receives messages from the processes in Q cor- 
responding to the hypotheses that can explain d. Now from the 
perspective of d, one of three things can happen: 

(1) p receives at least one Explained! message. Then p does 
nothing; 

(ii) p receives only Firm? messages. Then if there is some h, 
among the hypotheses that can explain d such that 6(h;) is above 
some high threshold 9, and there is no h that can explain d and 
has a belief value above 9, p sends a Firm! message to the q; 
process representing the h; hypothesis, and Nu// messages to 
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processes in Q corresponding to other hypotheses that can ex- 
plain the d; 

(iil) p receives only Firm? messages, and there is no hypothe- 
sis that is clearly the best explanation for d. Then p sends Null 
messages to processes in Q corresponding to hypotheses that 
can explain the d. | 

Process q receives messages from processes in P corre- 

sponding to d; € e(h). In the third cycle of processing, the Jn 
hypotheses are identified. At the end of the second cycle, from 
the perspective of h one of two things can happen: 
(1) g receives at least one Firm! message. Then qg sends an Ex- 
plained! message to processes in P corresponding to d; € e(h); 
(ii) g receives only Nul/ messages. Then gq sends In? messages 
to processes in P corresponding to d; € e(h). 

Process p receives messages from the processes in Q cor- 
responding to the hypotheses that can explain d. Now from the 
perspective of d, one of two things can happen: 

(1) p receives at least one Explained! message. Then p does 
nothing; 

(ii) p receives only Jn? messages. Then p selects the h with 
the highest belief value among the hypotheses that can explain 
d. If two (or more) hypotheses can explain equally well, then 
p breaks the tie between them by selecting the hypothesis with 
the largest explanatory coverage. If even this would not break 
the tie, then the selection is made at random. p then sends an 
In! message to the q corresponding to the selected hypothesis, 
and Null messages to processes in Q corresponding to other hy- 
potheses that can explain d. 

q receives messages from processes in P corresponding 
to d; € e(h). At the end of the third cycle, a C consisting of the 
Essential, Firm, and In hypotheses has been generated. 


Distributed Hypothesis Testing 


If Cy, C C is empty, then the generated C is best explanation 
since the Essential hypotheses cannot be removed from C’ with- 
out reducing e(C), and the Firm hypotheses cannot be removed 
without reducing b(C). If, however, Cy,, is nonempty, then C’ 
should be tested for parsimony by testing the Jn hypotheses in 
it for explanatory superfluousness. Again, it suffices to present 
an algorithm for distributed testing of C’ from the perspectives 
of a datum d € D represented by a process p € P and a hy- 
pothesis h € Cj, represented by a process q € Q. The testing 
of C begins when process q sends a Superfluous? message to 
processes in P corresponding to data d; € e(h). Process p re- 
ceives messages from the processes in Q corresponding to the 
In hypotheses that can explain d. Now from the perspective of 
d, one of three things can happen: 

(i) p receives no messages. Then p does nothing; 

(ii) p receives exactly one Superfluous? message. Then if the 
d is explained by some Essential or Firm hypothesis, p sends a 
Superfluous! message to the q; from whom it had received the 
message, else it sends back a Null message; 

(iii) p receives more than one Superfiuous? message. Again, if 
the d is explained by some Essential or Firm hypotheses, then 
p sends back a Superfluous! message to all processes in Q from 
whom it received messages. If, however, d is not thus explained, 


then p selects the hypothesis with the highest belief value among 
the Jn hypotheses that can explain d. Again, if two (or more) 
hypotheses can explain d equally well, then p breaks the tie be- 
tween them by selecting the hypothesis with the largest explana- 
tory coverage, and if this would not break the tie, then the se- 
lection is made at random. p then sends a Nu// to the process in 
Q corresponding to the selected hypothesis and a Superfluous! 
message to processes in Q corresponding to other /m hypotheses 
that can explain d. 

Process gq receives messages from processes in P corre- 
sponding to d; € e(h;). Now from the perspective of h, one of 
two things can happen: 

q receives only Null messages. Then g does nothing and h re- 
mains /n; 

q receives at least one Superfluous! message. Then if the num- 
ber of such messages received equals |e(h)|, h is explanatorily 
superfluous and is no longer Jn. 

We note that under certain conditions h could be ex- 
planatorily superfluous even if the number of Superfluous! mes- 
sages received by q in the last step is less than |e(h)|. This 
could happen, for instance, if the explanatory coverages of two 
or more hypotheses completely overlap and their belief values 
are exactly equal. In such case, each p process corresponding 
to some datum in the overlapping explanatory coverages would 
randomly select one of the candidate hypotheses for inclusion 
in C’. Moreover, since each such p process makes its selection 
locally, the randomness in hypothesis selection may also result 
in the retention in C of more than one of the candidate hypothe- 
ses even though all but one of them is explanatorily superfluous. 
This problem can be solved if during testing of C for parsimony, 
when faced with random hypothesis selection, each p process 
alerts a special process, say R, of the situation by sending a 
message containing a list of the candidate hypotheses. Process 
FR receives the messages from each such p process and makes a 
random choice between the hypotheses. It then retains the se- 
lected hypothesis in C and removes the other hypotheses from 
It. 


A similar problem may occur if the belief values of more 
than two Jn hypotheses are exactly equal and their explanatory 
coverages overlap cyclically. Let us consider as an example 
a situation in which hi,h2,h3 € Cin> b(h,) = b(h2) = b(h3), 
e(h1) = {d,d>}, e(h2) = {d2,d3} and e(h3) = {d3,d\}, where 
d,,d2,d; € D. Since the belief values and the cardinality of 
explanatory coverages of hi, h2,h3 are equal, the p processes 
corresponding to d,, dz, d3; make their choices randomly. Again, 
the p processes inform process R of the situation, and again R 
makes a random choice between the hypotheses retaining in C, 
for this example, any two of the hypotheses and removing the 
third. At the end of processing by R, a C that “best” explains D 
has been synthesized. 


Discussion of the Concurrent Model 
Analysis of Algorithms 


The worst case time complexity for distributed generation of C 
is given by 
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Thypothesis generation ~ O(m + n) 


where m is the cardinality of H and n is the cardinality of D. 


Similarly, the worst case complexity for distributed testing of C’ 
for parsimony is given by 


T hypothesis testing = O(myy x n) 


where my,, is the number of Jn hypotheses in C. We may com- 
pare this to the complexity of a non-optimal serial algorithm that 
has been used to build to several major abductive knowledge- 
based systems [7]. The worst case time complexity for the se- 
quential generation of C’ is given by [1] 


Thypothesis generation — O(n xX (m+n X log(n)) 


and the worst case complexity of testing C’ for parsimony is 
given by 


T parsimony testing = O(m x n x log(n)) 


Moreover, in the sequential algorithm, the Essential hypotheses 
are not found during the generation of C’. Instead, a C' is assem- 
bled without regard for the Essential hypotheses and then tested 
for essentialness of the hypotheses. The worst case time com- 
plexity for testing of C’ for essentialness of hypotheses is given 
by [1] 


Tessentialness testing = Om Xn x (m + n x log(n))) 


We note that the constants in the time complexities for 
sequential and distributed synthesis of C' are comparable since in 
both cases they arise from linear search. It thus appears that the 
concurrent algorithms provide a linear speed up of processing 
over the sequential algorithms. However, there are several rea- 
sons for caution about this claim. Firstly, the serial algorithms 
are non-optimal. Secondly, the time complexities are for the 
worst case, and not for the “average” case since the “average” 
case is domain dependent. Thirdly, the complexities are valid 
only under the assumptions of linearity of the synthesis task. 
Finally, we have not accounted for the costs of communication 
between the P and the Q processes in calculating the complex- 
ities of the concurrent algorithms. Despite these caveats, it is 
clear that distributed synthesis provides an attractive model for 
the synthesis of composite explanatory hypotheses. 


Architectural Implications 


There are several interesting and somewhat unusual aspects to 
the computational model for concurrent synthesis of C’ from the 
viewpoint of realizing it on a distributed memory, message pass- 
ing, parallel computer architecture. Firstly, the parallelism be- 
tween the semi-autonomous P and Q processes is fine-grained. 
Secondly, at any given time during the processing, the P and 
Q processes are either idle or executing the same instruction on 
different data. Thirdly, the process of synthesizing C' is com- 
munication intensive. For instance, in the first cycle during the 
generation of C’, each process g; € Q communicates with every 


process p; € P. Fourthly, for real world problems, the num- 
ber of P and Q processes is potentially very large. Even for 
small knowledge-based systems that use abductive inference to 
perform medical diagnosis in limited domains, for instance, the 
number of P and Q processes is typically in the hundreds. A full 
scale medical diagnostic system may well require thousands, or 
even tens of thousands of such processes. 

It appears that among the existing machines, the Con- 
nection machine [5] may be the most suitable architecture for 
realizing this model for distributed synthesis of C’. The Connec- 
tion machine is a distributed memory, message passing, parallel 
computer which is precisely the architecture required for real- 
izing the model. It is a single-instruction, multiple data stream 
machine which suits the control of processing in the model. It 
is based on the hypercube architecture which helps to keep the 
communication costs within acceptable limits. Finally, it sup- 
ports the type of massive, fine-grained parallelism between a 
large number of small, semi-autonomous processes that is present 
in this model for synthesizing C. 


Interactions between Hypotheses 


So far we have discussed only the linear version of the general 
synthesis task, where we assumed that the hypotheses h; € H 
are non-interacting. In fact, several distinct types of interaction 
between two hypotheses h;, hz € H have been identified [7]: 
Associativity: The inclusion of h,; in C suggests the inclusion 
of h2. Such an interaction may arise if the generator has knowl- 
edge of, say, a statistical association between h, and hz. 
Additivity: h, and hz cooperate additively where their explana- 
tory capabilities overlap. This may happen if h; and h, can sep- 
arately explain some d € D only partially, but collectively can 
explain it fully. 

Incompatibility: h; and hz are mutually incompatible, i.e., if one 
of them is included in C then the other should not be included. 
Cancellation: h, and hz cancel the explanatory capabilities of 
each other in relation to some d € D. For example, h; may im- 
ply that some data value will increase, while h. may imply that 
the value will decrease, thus canceling each others explanatory 
capability with respect to that datum. 

We note that the general synthesis task is nonlinear in 
the presence of incomptability interactions and nonmonotonic in 
the presence of cancellation interactions. The concurrent model 
for distributed synthesis of composite explanatory hypotheses 
presented in this paper can be extended to accommodate the 
above interactions by allowing for message passing between the 
Q processes. This enables the hypotheses’ processes to negoti- 
ate among themselves and resolve the conflicts that arise due to 
the presence of these interactions. 
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Abstract 


Block algorithms for matrix factoriza- 
tions have gained much attention in recent 
years as an approach that takes advantage of 
computer architectures with a relatively 
small amount of very high speed cache 
memory. The advantage of this approach is 
efficient re-use of data while it is in cache. 
Block algorithms are also useful for general 
purpose single or multi-headed vector 
machines with interleaved memory. How- 
ever, their performance advantage over tradi- 
tional matrix-vector algorithms is very 
problem-dependent. An example of such 
behavior is the LU factorization of a general, 
dense matrix. This paper presents a block 
LU factorization algorithm that is appropri- 
ate for the CONVEX C Series machines. 
Performance comparisons are _ provided 
against the matrix-vector Crout algorithm. 


1. Introduction 


Gaussian elimination can be viewed as 
the reduction of a matrix, A, to upper tri- 
angular form using multiplication by 
transformation and row _ permutation 
matrices as follows: 


MPM Pye °° MPA = L°A=U 


where the P, are row permutation matrices 


and the M, are defined by 


t 


y, = [00 “m, | M,=I+ Y,e, 


where there are k zeros in y,, e, 1s the &th 
column of the identity matrix and m, is the 
vector of negative multipliers. A simple 
manner of applying a block p x p reduction is 
to develop an algorithm for the update 


A, = ( My. pal oO M,,, 


Pia MP, ) A, 


k+p-l 
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or an update based on the product of p Gauss 
transformation and row _ permutation 
matrices. This approach is similar to that in 
Dongarra [4]. The upper left-hand corner of 
A, will be a,,,,,, and the lower right-hand 
corner will be a,,. The notation 2”) will be 
used to denote a section of a vector beginning 
at element 7. Define 


WP Lm mom, T 
t 
ae | Qi, Geaie ant 
a, = | Giese E+ O28 nk |’ 


Following is the rank-p update formula for 
p =2. Operations relating to pivoting are 
deleted to conserve space. 


== 1 
while k < n-l 
| a fa; 
m, = 5,4, 
Gin Om Ua ~ Up ™ 
=| 
Up = | Gere Mere Gin 
et = 
i isi k+e Gis k+3 Aisin 7 Mee 


Si41 L/O,, 5441 


Mis — $5414) 


ame (2) 2 t 
A, oe A, —~ Mm, U, — M1 Ue 
=k + 2 
end k 


Any remaining portion of the computation 
that cannot be included in the paired updat- 
ing scheme is finished off with a scalar ver- 
sion of the rank-1 algorithm. The algorithm 
for p = 3 is 


=] 
while k < n-l 
7 1/a,, 


m, = 8a, 
Dine oO Oya — OEE 1M, 
t ee ry 
a Gicr+e "Wies a. 
t se e ° e ¢ 
ae <— | Gena k+2 Oe+1k+38 Gein | 
t _ 7 t 
Via ae sla a Uy 
Si 41 a Ne 
Mey A 8p 4 144) 
a :— al - @ m® - @ m 
k+2 ° k+2 kk+ ok k+1k+2 k+1 
t.. =([a a an | 
k+2 E+Qk+3 “k+2,k +4 k+2,n 
t t 
t _ yt (2) (2) 
U5 > tio - my: > Me Cee 
Si +o ed 1/0, 42442 
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k:=k+ 3 
end k 


As before, scalar cleanup with p = 1 is 
used. The rank-2 algorithm requires two 
multiply-adds per two memory references in 
the update, which runs at full speed on a 
machine with one path to memory. The 
rank-3 approach runs even faster because the 
columns of the submatrices at each stage are 
loaded and stored fewer times. The rank-3 
updates can also be coded to run the column 
memory references one column ahead of the 
current update. This eliminates the 
load/store pipeline latency upon branching to 
the top of the loop. This approach saves a 
significant amount memory start-up time in 
the updates. 


These algorithms are different than 
traditional block LU algorithms in that the 
initial ‘“‘block”’ reduction is not performed on 
a 2x 2 or 3 x 3 block. Instead, the opera- 
tions are performed in a manner similar to 
the standard algorithm. This method retains 
the same (longer) vector lengths as the stan- 
dard algorithm. It is also very easy to 
integrate the block column and row opera- 
tions with the pivoting memory traffic. In 
fact, the block px p LU factorization is a side 
effect of the current algorithm! Based on the 
work in Armstrong [1], the best performance 
to date from such a method has been to use 
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rank-3 updates and switch to rank-2 updates 
after the reduction reaches a certain point. 
The current crossover is when the submatrix 
order reaches 128 or less. If the entire factor- 
ization is coded in assembly language, then 
the first 128-element (or less) section of the 
last multiplier array can be retained in a vec- 
tor register for use in the first block section 
of the rank-p update. This is currently the 
approach used on the C120. Because of the 
increased memory subsystem efficiency on the 
C210, this method has only a negligible 
advantage over the rank-3 algorithm. The 
performance in Megaflops (Mflops) is pro- 
vided in the tables below. Dongarra’s matgen 
routine (from the LINPACK benchmarks) 
was used to generate the system of equations 
and the flop count used was 2/3n° + 2n’. 
The M-V column represents the matrix- 
vector Crout algorithm implemented in FOR- 
TRAN with calls to assembly language 
IDAMAX and DGEMV. 


C120 Rank-p update performance 64-bit Mflo DS 
pon | MV | Rank-3 | Rank-(3,2 


C210 Rank-3 update performance, 64-bit Mflops 
M-V 


Block LU Factorization 


2. General 
Algorithm 


An excellent introduction to the 
rationale and development of block algo- 
rithms is provided in references [2,3,5,6]. 
This section presents an algorithm which gen- 
eralizes the rank-p approach. The diagram 
below illustrates the fundamental idea behind 
the algorithm. The p x p block submatrix B 


contains - in effect - an LU factorization of a 
block of the submatrix at stage k. However, 
the computation of multipliers and row 
updates is extended to the extremes of the 
submatrix, which produces sub-blocks B, and 
B,. As before, the main idea of the algo- 
rithm is to compute B, and B,. The block 
LU factorization in B is the result this com- 
putation. This is in contrast to traditional 
block algorithms that compute B first, and 
use this result to generate B, and B,. The 
remainder of the submatrix, A,, is updated 
according to a formula of the form 


A, = AY + aB,B, 


where a = + /-1 depending on whether posi- 
tive or negative multipliers are used. The 
algorithm is implemented in two parts. The 
first part computes the submatrices B,, and 
B, for a given value of p. The second part 


simply consists of a call to GEMM from the 
level III BLAS. 


B, and B, are generated, from a 
matrix-vector algorithm that applies rank-1 
updates one column and row at a time. The 
advantage of this approach is fewer memory 
references than in a standard SAXPY algo- 
rithm for applying the updates. The algo- 
rithm requires an initial pivot operation to 
‘“‘prime”’ it. The main loop consists of com- 
puting a vector of multipliers for the j-th 
stage. This vector, along with the previous 
multiplier vectors from the beginning of the 
block, are used to update the next column. 
This update can be formulated as a matrix- 
vector product since all the updates are not 
done at once. Instead, they are applied “on 
the fly’. Next, the pivot operation for the 
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following stage is performed. A transposed 
matrix-vector product is then used to apply 
all the previous row updates to the remainder 
of row 7+1 of U. 


The above summary can be mathemati- 
cally illustrated by partitioning the sub- 
blocks of B, and B, at stage 7 as follows 


oe Uo, B 
Bi= [Bi " BL = | OO 


2 


Also, define en as the first row of the sub- 
matrix Bi”. At this point, everything has 
been computed except 6, and b.. These vec- 


tors are generated by the following equations: 


— j-l 
b:= 6, - Btu... 


b, := 6 _ |! 


2 j+l 


yn 
B, 


1 


Following is the algorithm for comput- 
ing B,, and B,, which is a black bock in the 
main algorithm. The current stage of the 
reduction is denoted by k and the block size 
by p. The main loop is_ executed for 
j =k,k+1,...,k+p-2. Inside this loop, the 
submatrix B’” has its upper left-hand corner 
at a,,,, and is an (n-j) x (j-k+1) matrix. 
Bi, has its upper left-hand corner at a, ,,, 
and is a (j-k+1) x (n-j-1) matrix. The 
vectors u.,, and i +, are of length j-k+1. 
u,,, has its initial element at a, .,,, and Lee 
starts at a 6, has its initial element at 
Gini j41, and b, starts at @..1;+9- In the fol- 
lowing algorithm, the phrase ‘“‘pivot for stage 
zy’ indicates the operation of finding the pivot 
row for stage j and performing the row inter- 
change. A test for zero pivots is assumed. 
Also, a, has the same definition as in previous 


yri1k’ 


algorithms. 
pivot for stage k - primer for the algorithm 


For 3 =k,k+p-2 
s=1/a, 


1 
b, =a b, -B; Ue 
pivot for stage 7+1 


b; == b; = i 


yt 


j-l 
B, 


end 3 


$= VG pi aps 


Geet ov SQ, 54 


An interesting feature of the above 
algorithm is that, if the block size is set to n, 
the result is the same basic approach as the 
matrix-vector version of Crout factorization. 
Thus, one can derive such factorization algo- 
rithms from the viewpoint of compact elimi- 
nation or from applying rank-1 updates ‘‘on 
the fly.” 


The complete block LU factorization 
can now be presented for a given block size p. 
Let m = n-k-p+1. Then the submatrix A, 
that is updated has its upper left-hand corner 
at 4,,,;4, and is an m x m matrix. B, has 
its upper left-hand corner at a,,,, and is an 
m x p matrix. B, has its upper left-hand 
corner at a@,,,, and is a p xX m matrix. 
Assume that n/p is computed as in Fortran. 


l= n/p 
kA 
For: = 1,l 


compute B, and B, 
A 2A, = Di, 


k:=k+p 
end 1 


Clean up leftover submatrix beginning at a,, 
using the rank-2 update algorithm described 
in section 4. 


3. Performance Notes 


Experiments conducted on the CON- 
VEX C120 have indicated that in order to 
approach the performance of the specific 
rank-2 and rank-3 algorithms in section 1, 
large block sizes are required. The asymp- 
totic performance seems to be reached with 
p = 200. This indicates the use of a multi- 
algorithm approach. For n s 200, the rank- 
3 algorithm can be used. For n > 200, the 
general block algorithm is used with rank- 


(2,1) cleanup in FORTRAN. 


The general block algorithm with 
assembly language IDAMAX, DGEMV, and 
DGEMM executes at 14.98 Mflops for a 300 x 
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300 system. For n = 1000, it executes at 
15.75 Mflops. For n > 1000, the general 
block algorithm asymptotically performs 1.2 
Mflops faster than matrix-vector Crout. On 
the C210, the asymptotic performance advan- 
tage is 2.3 Mflops 


An advantage to the general block algo- 
rithm is that it can be easily implemented in 
FORTRAN with calls to the level II and III 
BLAS. A disadvantage over the specific 
rank-p update algorithms with small values 
of p is the inability to merge the computation 
of B, and B, with the pivoting memory 
traffic. This is reason for the performance 
advantage of the rank—p approach. 


A later paper will describe the imple- 
mentation and parallel performance on a 


dual-headed CONVEX ©220. 
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Abstract — Polynomial systems consist of n polynomial functions 
in n variables, with real or complex coefficients. Finding zeros of 
such systems is challenging because there may be a large number 
of solutions, and Newton-type methods can rarely be guaranteed to 
find the complete set of solutions. There are homotopy algorithms for 
polynomial systems of equations that are globally convergent from an 
arbitrary starting point with probability one, are guaranteed to find 
all the solutions, and are robust, accurate, and reasonably efficient. 
There is inherent parallelism at several levels in these algorithms. 
Several parallel homotopy algorithms with different granularities are 
studied on several different parallel machines, using actual industrial 
problems from chemical engineering and solid modeling. 


1. Introduction. 


Solving nonlinear systems of equations is a central problem in 
numerical analysis, with enormous significance for science and en- 
gineering. A very special case, namely small polynomial systems 
of equations, occurs frequently enough in solid modeling, robotics, 
computer vision, chemical equilibrium computations, chemical pro- 
cess design, mechanical engineering, and other areas to justify spe- 
cial algorithms. To put polynomial systems in perspective and for 
the purpose of this discussion, there are three classes of nonlinear 
systems of equations: (1) large systems with sparse Jacobian ma- 
trices, (2) small transcendental (nonpolynomial) systems with dense 
Jacobian matrices, and (3) small polynomial systems with dense Ja- 
cobian matrices. Sparsity for small problems is not significant, and 
large systems with dense Jacobian matrices are intractable, so these 
two classes are not counted. 

Large sparse nonlinear systems of equations, such as equilibrium 
equations in structural mechanics, have two characterizing aspects: 
highly nonlinear and recursive scalar computations, and large ma- 
trix, vector operations. There is a great amount of parallelism in 
both aspects, but the nature of the parallelism is very different (or 
so it seems). Small dense transcendental systems of equations pose 
a major challenge, since they involve recursive, scalar intensive com- 
putation with a small amount of linear algebra. Finally, polynomial 
systems are unique in that they have many solutions, of which several 
may be physically meaningful, and there exist homotopy algorithms 
guaranteed to find all these meaningful solutions. The very special 
nature of polynomial systems and the power of homotopy algorithms 
are often not fully appreciated, perhaps because globally convergent 
probability-one homotopy methods are not widely known. 

These globally convergent homotopy algorithms for polynomial 
systems have inherent parallelism at several levels. The purpose of 
the present paper is to study different granularities of parallel homo- 
topy algorithms for polynomial systems, corresponding to different, 
decomposition and communication strategies. 

Much work has been done on solving linear systems of equa- 
tions on parallel computers, mostly on vector machines [4], [5], [7], 
[8], [L0]-[12], [14]-[16], [18], [23], [25]. Some work has been done on 
nonlinear equations and Newton’s method [28], [31], [36], [37], and 
on finding the roots of a single polynomial equation [9], [27]. Par- 
allel algorithms for polynomial systems were proposed by Morgan 
and Watson [22], but have not been studied much, nor have parallel 
homotopy algorithms for nonlinear systems of equations. 

Section 2 summarizes the mathematics behind the homotopy 
algorithm, Section 3 discusses the special case of polynomial systems 
in some detail, and computational results on several parallel machines 
are presented and discussed in Section 4. 


2. Homotopy algorithm. 


Let E” denote n-dimensional real Euclidean space. The fun- 
damental mathematical result behind the homotopy algorithm for 
solving the nonlinear system of equations 
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F(z) = 0, (1) 
where F: E” — E” is a C? (twice continuously differentiable) func- 
tion, is as follows: 

Proposition 1 ((6], [33]). Let p:E™ x [0,1) x E” — E” be a C? 
map and define pg(A, 2) = p(a, A, 2). Suppose that 
1. the nx (m+n-+1) Jacobian matrix Dp has full rank on p~1(0), 
the set of zeros of p; 
2. pa(0,z) = 0 has a unique solution W of E” (depending on the 
homotopy parameter vector a of E™); 
3: pale) =F (2); 2 
A. the set of zeros of pa(A,z) is bounded. 
Then for almost all a of E™ there is a zero curve 7 of pa(A, 2) along 
which the Jacobian matrix Dp,(A, x) has full rank, emanating from 
(0,W) and reaching a zero & of F(z) at A = 1. Furthermore, ¥ has 
finite arc length if DF (Z) is nonsingular. 


The general idea of the algorithm is to follow the zero curve y of 

Pa from (0,W) until a zero & of F(x) is reached at A= 1. Although 

the homotopy algorithm for solving the nonlinear system of equations 

is conceptually simple, it is nontrivial to develop a viable numerical 

algorithm for tracking the curve. A typical form for the homotopy 
map is 

pw(A,2) = AP(2) + (1—A)(e— W) =0, (2) 


which has the same form as a standard continuation or embedding 
mapping. However, two crucial differences exist. In standard contin- 
uation methods, the embedding parameter \ increases monotonically 
from 0 to 1 as the trivial problem z—W = 0 is continuously deformed 
to the problem F(x) = 0. Homotopy algorithms permit A to both 
increase and decrease along y with no adverse effect. The second dif- 
ference is that in homotopy algorithms there are no “singular points” 
which afflict standard continuation methods. This is guaranteed by 
the way in which the zero curve y of pg is followed and the fact. that 
Dpa has full rank along y . 

The zero curve ¥ of the homotopy map pa(A, x) can be tracked by 
many different techniques. The present work used HOMPACK [34], 
a software package for solving systems of nonlinear equations based 
on the homotopy method. HOMPACK provides three approaches for 
tracking y : 1) an ODE-based algorithm with some special refine- 
ments for the homotopy context; 2) a predictor-corrector algorithm 
whose corrector follows the flow normal to the Davidenko flow (a 
“normal flow” algorithm); and 3) a version of Rheinboldt’s linear pre- 
dictor, quasi-Newton corrector algorithm (an “augmented Jacobian 
matrix” method). Since the “normal flow” algorithm is the technique 
used by HOMPACK to solve polynomial systems of equations, the 
other two algorithms will not be discussed in this paper. 

The normal flow algorithm has three phases: prediction, cor- 
rection, and step size estimation. In the prediction phase, the next 
point on the zero curve is predicted. Starting from the predicted 
point, the correction phase then iterates until a point on the zero 
curve is reached. An “optimal” step size is then estimated for the 
prediction of the next point on the curve. 

The normal flow algorithm is so called because the iterates of the 
correction phase converge from the predicted point back to the zero 
curve along the flow normal to the Davidenko flow. The Davidenko 
flow is the family of zero curves formed from varying the parameter 
vector a of the homotopy map pg. 

The zero curve y of pa(A,x) is C! and can be parametrized by 
the arc length s. Thus \ = X(s) and x = 2(s) with initial conditions 
(0) = 0 and 2(0) = W. When X(s) = 1, the corresponding 2#(s) = z 
is a zero of F(z). 

Given that the following are available from previous calcula- 
tions : two previous points on the curve, P(s,) = (A(s1), #(s1)) and 


P(s2) = (A(s2), 2(s2)), their corresponding tangent vectors, P'(s1) = 
(dd/ds(s,),dx/ds(s1)) and P’(s2) = (dA/ds(s2), dx/ds(s2)), and h, 
an estimate of the next “optimal” step size (in arc length) to take 
along y , the next point on the zero curve can be estimated by 


(3) 


where H(s) is the Hermite cubic polynomial which interpolates P(s) 
at s; and so. Thus, H(s1) = P(s1), H’(s1) = P’(s1), H(s2) = P(s2), 
and H'(s2) = P'(s2). 


Since 


Z©) = H(so +h), 


pa(A(s), 2(s)) = 0 (4) 


on the zero curve y , 


d 
7, Wal A(s), #(s))] = Dpa(A(s), (s))(dA/ds, da/ds)" =0, (5) 
where (dA/ds,dz/ds) is the tangent vector to the curve. Thus, the 
tangent vector can be calculated by finding the kernel of the Jacobian 
matrix Dpa(A(s), (s)), which has rank n by Proposition 1. Once the 
kernel is found, the derivative (dA/ds, dz/ds) at a given point on the 
zero curve can be uniquely determined by 
Il(d4A/ds, dz/ds)||2 = 1 (6) 
and continuity of the tangent vector. 


Starting at the predicted point Z©), the iteration in the correc- 
tion phase is 


eg 0 he Ere 


ZR age Dea (2®)| pa(Z™?), (7) 


where [Dp,(Z Gy]! is the Moore-Penrose pseudoinverse of Dpg. Re- 
ordering the equation, the corrector step AZ = Z(*+1) — Z(*) is the 
unique minimum norm solution of 
[Dpa(Z)] AZ = —pa(Z™). (8) 
Fortunately AZ can be calculated at the same time as the kernel of 
[Dpal with just a little more effort. Normally for dense problems the 
kernel of [D Pal is found by computing the QR factorization of [D Pal ; 
followed by back substitution. By applying this QR factorization to 
—pa and using back substitution again, a particular solution v to (8) 
can be found. Let u be any non-zero vector in the kernel of [Depa]. 
Then the minimum norm solution is 
t 
RG Clipe as 


~ (9) 

Since the QR factorizations of [Dpa| are computationally very 
expensive, the number of iterations required for convergence of (7) 
should be kept small (say <= 4) by adjusting the step size. An al- 
ternative is to use the QR factorization of Dp, at the first predicted 
point Z(©) for several iterations. However, this results in linear con- 
vergence, which is not cost effective when compared to the asymp- 
totically quadratic convergence of (7). 

Note that the kernel of [Depa] is needed for the tangent vector 
used in the Hermite cubic interpolation at the beginning of the next 
step. When the iteration converges, the final iterate Z(*+)) is ac- 
cepted as the next point on y . Rather than calculating the tangent 
vector at the new point Z(*+)) on y , a Jacobian matrix evaluation 
and a QR, factorization can be saved by using the tangent vector 
calculated at Z). This substitution should not seriously affect the 
calculation of the next point on the curve since this tangent is used 
only in the prediction of the next point, and the tangent vector at 
Z*) is a good estimate of the tangent vector at Z(#+), 


The estimation of an “optimal” step size h is an attempt to bal- 
ance the number of iterations in the correction phase of the algorithm 
with the number of steps necessary to reach the “end” point on the 
zero curve where A(S) = 1. Increasing the step size decreases the 
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number of steps necessary to reach the “end” of the curve. However, 
taking too large a step size would result in a substantial increase 
in the number of iterations necessary to correct the predicted point. 
The estimation and control of the step size h is discussed in detail in 


[38]. 
3. Polynomial systems. 

Section 2 described a homotopy algorithm for finding a single 
solution to a general nonlinear system of equations F(z) = 0. Propo- 
sition 1 provided the theoretical guarantee of convergence. The rich 
structure and multiple solutions of polynomial systems dictate that 
the general theory in Section 2 must be sharpened. This section 
develops a globally convergent (with probability one) homotopy al- 
gorithm that finds all solutions to a polynomial system, and provides 
the theoretical justification for that algorithm. 

Suppose that the components of the nonlinear function F(z) 
have the form 


nN; n 
aisr a 
Fi(z) =) aix [[ 25", a 
k=1 j=l 


1... (10) 


,n. 


The ith component F(x) has n; terms, the a;, are the (real) coeffi- 
cients, and the degrees d;;, are nonnegative integers. The total degree 
of F; is dj = max, jet d;;;. For technical reasons it is necessary to 
consider F(x) as a map F : C® — C™, where C” is n-dimensional 
complex Euclidean space. A system of n polynomial equations in n 
unknowns, F(#) = 0, may have many solutions. It is possible to de- 
fine a homotopy so that all geometrically isolated roots of (10) have at 
least one associated homotopy path. Generally, (10) will have roots 
at infinity, which forces some of the homotopy paths to diverge to 
infinity as A approaches 1. However, (10) can be transformed into 
a new system which, under reasonable hypotheses, can be proven to 
have no roots at infinity and thus bounded homotopy paths. Because 
scaling can be critical to the success of the method, a general scaling 
algorithm [34] is applied to scale the coefficients and variables in (10) 
before anything else is done. 

Since the homotopy map defined below is complex analytic, the 
homotopy parameter 2 is monotonically increasing as a function of ' 
arc length [21]. Define G : C" — C” by G;(z) = b;2;4i —a;, i 
1,...,n, where a; and 6; are nonzero complex numbers and d; is the 
(total) degree of F(x), for 7 = 1,...,n. Define the homotopy map 


(11) 


where c = (a,b), a = (a1,...,4n) € C” and 6 = (b4,...,bn) € C”. 
Let d= d,---d, be the total degree of the system. The fundamental 
homotopy result, proved and discussed at length in [19]-[21], is: 


pe(d,2) = (1—d) G(x) + AF(2), 


Theorem. For almost all choices of a and b in C”, pz!(0) consists 
of d smooth paths emanating from {0} x C”, which either diverge to 
infinity as A approaches 1 or converge to solutions to F(x) = 0 as A 
approaches 1. Each geometrically isolated solution of F(z) = 0 has 
a path converging to it. 


A number of distinct homotopies have been proposed for solving 
polynomial systems. The homotopy map in (11) is from [20]. As 
with all such homotopies, there will be paths diverging to infinity if 
F(x) = 0 has solutions at infinity. These divergent paths are (at least) 
a nuisance, since they require arbitrary stopping criteria. Solutions 
at infinity can be avoided via the following projective transformation. 


Define F’(y) to be the homogenization of F(z): 


es eee (12) 
The set of all lines through the origin in C"t? is called complex pro- 
jective n-space, denoted C'P”, and is a smooth compact (complex) 
n-dimensional manifold. The solutions of F’(y) = 0 in CP” are iden- 
tified with the (finite) solutions and solutions at infinity of F(x) = 0 
in the usual way [38]. A basic result on the structure of the solu- 
tion set of a polynomial system is the following classical theorem of 


Fe(y) = yng) Fj(yt/Ynt1,-+-1Yn/Yn+1)s Nn. 


Bezout [21]: 


Theorem. There are no more than d isolated solutions to F’(y) = 0 
in CP”. If F’(y) = 0 has only a finite number of solutions in C'P”, 
it has exactly d solutions, counting multiplicities. 


Recall that a solution is isolated if there is a neighborhood containing 
that solution and no other solution. The multiplicity of an isolated 
solution is defined to be the number of solutions that appear in the 
isolating neighborhood under an arbitrarily small random perturba- 
tion of the system coefficients. If the solution is nonsingular (i.e., the 
system Jacobian matrix is nonsingular at the solution), then it has 
multiplicity one. Otherwise it has multiplicity greater than one. 
Define a linear function u(yi,...,Yn41) = 6141 + €2yo2 +--+ + 


En41Yn41 Where €1,...,€n41 are nonzero complex numbers, and de- 
fine F” : CPt! —. C™+! by 


Fi'(y) = Fj(y), 
Fagily) = u(y) — 1. 


So F”(y) = 0 is a system of n + 1 equations in n + 1 unknowns, 
referred to as the projective transformation of F(x) = 0. Since u(y) 
is linear, it is easy in practice to replace F”(y) = 0 by an equivalent 
system of n equations in n unknowns. The significance of F'’’(y) is 
given by 


pe dy cca 
? (13) 


Theorem[19]. If F’(y) = 0 has only a finite number of solutions in 
CP”, then F”(y) = 0 has exactly d solutions (counting multiplicities) 
in Ct! and no solutions at infinity, for almost all € € C"+?. 


Under the hypothesis of the theorem, all the solutions of F’(y) = 
0 can be obtained as lines through the solutions to F”(y) = 0. Thus 
all the solutions to F(«) = 0 can be obtained easily from the solutions 
to F”(y) = 0, which lie on bounded homotopy paths (since F’’(y) = 0 
has no solutions at infinity). 

The import of the above theory is that the nature of the zero 
curves of the projective transformation F”"(y) of F(x) ts as follows: 
There are exactly d (the total degree of F) zero curves, which are 
monotone in A and have finite arc length. The homotopy algorithm 
is to track these d curves, which contain all isolated (transformed) 
zeros of F. 


4. Computational results. 


There are two extreme approaches for parallelizing the homo- 
topy algorithm. For the coarsest form of parallelism, each individual 
processor tracks as many solutions (paths) as possible until all the 
solutions for the system of polynomial equations are found. In the 
other extreme, where the granularity is the finest, the primary task of 
tracking the solutions is delegated to one of the processors and only 
during polynomial system evaluations, Jacobian matrix evaluations, 
and other numerical calculations is the work distributed among the 
processors. 

In the first case, the division of work is at the highest level. Ini- 
tially, the parameters defining the system of polynomial equations 
(F(x) = 0) are distributed to the processors. Each processor then 
works independently of the other processors. When a processor com- 
pletes the tracking of a path, it takes the next path to be tracked. 
Since there is no knowledge about the paths, they are assigned on a 
first-come-first-serve basis. Thus, if a “bad” path exists (one which 
requires a large number of function evaluations with respect to the 
remaining paths to be tracked), the load would not be distributed 
evenly among the processors. The processor with the “bad” path 
would cause the other processors to remain idle after all the remain- 
ing paths have been tracked. 

The second approach is an attempt to balance the load more 
evenly, hopefully, resulting in an overall speedup over the coarser 
grained algorithm. In this paper, the fine-grained parallel homotopy 
algorithm distributes the work of evaluating the system of polyno- 
mial equations and its partial derivatives to N processors, where N 
is the number of equations or the maximum number of processors, 
whichever is smaller. 
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Two parallel versions, a coarse-grained version and a fine-grained 
version, of the homotopy algorithm for polynomial systems were de- 
veloped from the parallelization of HOMPACK. They were executed 
on several parallel machines: Balance 21000, Elxsi 6400, Alliant 
FX/8, and the Intel iPSC-32. The execution times are shown in 
Table 1 of [38] along with those for the execution of the serial al- 
gorithm. The efficiencies are listed in Table 2 of [38]. The total 
degree of a problem refers to the number of solutions in the system 
of polynomial equations. Thus, if a machine has infinitely many pro- 
cessors, the maximum number of processors which can be utilized by 
the coarse-grained algorithm is determined by the total degree of the 
problem. 

This paper is primarily concerned with the results obtained on 
bus-oriented, shared-memory parallel machines. The primary dif- 
ference between implementations of parallel homotopy algorithms 
on shared-memory machines and implementations on distributed- 
memory machines is that, in distributed-memory machines, the mas- 
ter processor must communicate the tasks to the slave processors 
individually and wait for the results through some form of commu- 
nication medium. After a set of results is received from one of the 
slave processors, the master processor assigns the next task to that 
free processor. Both the master processor and the slave processors 
must handle the processing of the communication protocols. 

In shared-memory machines, the task of communicating the 
problem and obtaining the results is much simpler. It is handled 
through shared-memory, which is accessed by all the processors. The 
primary processor sets up the problem in the shared-memory, ini- 
tiates the processors, and handles portions of the problem like the 
other spawned processors. Since the coordination of the processors 
is done through mutual exclusion of certain critical shared memory 
locations, no master processor is needed. 

The calculations for the coarse-grained efficiencies are based on_- 
the maximum number of processors used on each of the parallel ma- 
chines. The number of processors used on Sequent’s Balance 21000, 
Elxsi’s 6400, and Alliant’s FX/8 are 8, 10, and 8, respectively, except 
for problems where the total number of paths to be tracked is less 
than the number of processors. 

The efficiencies for the coarse-grained algorithm with various 
number of processors on Alliant’s FX/8 are shown in Table 3 of [38]. 
As one would expect, the efficiency improves as the number of pro- 
cessors decreases. 


5. Conclusions. 


When the efficiencies of the two algorithms are compared using 
an equal number of processors, the coarse-grained algorithm is found 
to be more efficient than the fine-grained algorithm. In terms of 
speedup, the coarse-grained algorithm outperforms the fine-grained 
algorithm. 

The problem with the coarse-grained algorithm is that when 
some solutions have long paths with respect to other solutions, the 
efficiency can be very low. This depends on the order in which the 
solutions are found and the total number of solutions with respect to 
the number of processors. Figure 1 demonstrates that by changing 
the order in which the solutions (of problem 602 in [38]) are found, the 
distribution of the work load can either be very unbalanced or very 
well balanced. In Figure 1, curve 1 shows that the work load can be 
distributed quite evenly among the processors when the “long” paths 
are assigned first. The distribution of the work load in the actual 
assignment of the paths is shown by curve 2. The worst distribution 
(curve 3) occurs when the “longest” path is assigned last while the 
rest of the paths are assigned in decreasing order in terms of the num- 
ber of function evaluations. From these distributions, curve 1 has the 
“best” efficiency (0.99) and curve 3 has the “worst” efficiency (0.56), 
whereas the actual efficiency was 0.67. 

In general, the coarse-grained parallel homotopy algorithm is 
more efficient and permits a higher degree of parallelism than the fine- 
grained algorithm. However, since the number of function evaluations 
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Figure 1. The distribution of work in coarse-grained parallelism for 
various orderings of paths for problem 602 on the Alliant FX/8. 
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Abstract This paper proposes a new systolic architecture and a 
systolic algorithm for fast matrix multiplication. The computation 
time is em - 1 on the systolic array of m? processing elements. No 
any other known systolic algorithms can reach this time complexity 
by using a systolic array of m? processing elements. 

To demonstrate the power of these methods, we apply them to 
solve the shortest path problem using a partition approach. The 
proposed block algorithm for the shortest path problem overcomes 


difficulties in the management of a large-size graph. 


1. Introduction 

With the advance of VLSI technology, a new technological en- 
vironment is now available for manufacturing computers with high 
computation speed. It is now possible to implement a logic circuit 
consisting of hundreds of thousands of components and to build a 
large scale computing network with many inexpensive processing 
elements to perform parallel computation. 

Among the several approaches to the parallel organization that 
can take advantage of the new technologies, the systolic arrays are 
of great potential to achieve high concurrency and parallel com- 
putation. Systolic arrays{[H.T.Kung80, 82, S.Y.Kung 82, 87] have 
been developed to be efficient architectures for the solution of reg- 
ular, computationally intensive peoblems. 

A systolic array is an array of individual processing elements 
each of which is locally connected with its nearest neighbours to 
perform the same basic operations and to distribute the signal and 
the data across the entire processor array in a highly parallel and 
pipelined fashion. The simplicity of the processors and the unifor- 
mity of the processor interconnection allow the large systolic arrays 
to be implemented effectively on VLSI chips. 

Matrix multiplication is one of the most important operations 
in diverse fields of computer sciences such as signal processing, 
image processing, graph theory and linear algebra. 

In this paper, we propose a new systolic array and a systolic 
algorithm for computing matrix multiplication. 


2. An Efficient VLSI 2-D Systolic Array 


2.1 Description of Systolic Array Architecture 


The proposed systolic array contains m? processing elements(PEs). 


Each PE can perform basic arithmetic and logic operations. There 
are two communication lines for each PE. The diagonal line sup- 
ports two-way routing, while the vertical line supports only top- 
down transmission. The diagonal lines of boundary PEs are con- 
nected to the boundary PEs of the next row/column on the op- 
posite side. There are serveral input/output buffers connected to 
the boundary PEs. More specifically, we have five m-by-1 array 
buffers, named tM*, tM’, £M°, rM®, and bM”. Each of the first 
four array buffers holds a row of the input matrices, the bottom 
array buffer, bM”, will hold the output. The connections among 
PEs and between array buffers and boundary PEs are depicted in 
Figure 1. 

The following rules specify formally the interconnection of PEs 
to PEs and PEs to array buffers or vice versa. 


Interconnection Rules for Architecture Design 
(1) PEs to PEs 
Ri (PEij) = PE(i+1)mod(m),5) 
Ro(PEi5) = PE (i+1)mod(m),(j-+1)mod(m)) 
R3( PE: 3) = PE(—1)mod(m),G—1)mod(m)) 
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(2) PEs to array buffers and vice versa 
R(M?) = R(tM?*) = PE1 3 
R(fM?) PE(i41)mod(m),1) 
R(rM?) = PEim 
R(bM}) = P Em (i+1)mod(m) 


2.2 A New Systolic Array Algorithm 

A new systolic algorithm for computing the product matrix W 
of two m-by-m matrices A and B is described below. The product 
matrix W is defined by the formula w,, = et Ary * by,, where 
Way € W, dey € A, and by, € B. The following notations will be used 
to describe the algorithm : 

o _} 9 if initial 

(1) ws, = previous result otherwise 
(2) we, = wey + (dey * byz ) 
(3) Wee = Wry 

We assume that the input matrices in the memory modules 
can be accessed one row at a time. The ordering of rows of each 
input matrix that are loaded from memory modules are specified 
as follows: Matrices A and W are loaded from the low-numbered 
row to the high-numbered row starting with loading the first rows 


of A and W to tM? and tM™ respectively. Matrix B is loaded in 
a slightly different way, two rows of B are loaded simultaneously, 
starting with loading the (m/2)th and the (m/2 + 1)th rows to 
fM° and rM? respectively, then the (m/2 - 1)th and the (m/2 + 
2)th rows at the next computing cycle and so on until the first and 
the last rows are reached. 

At the first step of the algorithm, elements of matrix B are 
piped from the column array buffers, rM° and f£M°, into the 2-D 
systolic array as shown in Figure 2(a). Then the contents of PE, 
PHom; ++) PEmm are b(m/2)+1,m) very b(m/2)-+1,2) b(m/2)+1,1 respec- 
tively, and the contents of PE, 1, PEo1, ..., PEm, are bmjai, 
bm/2,ms +1 Mm/2,2 respectively. Within m/2 steps, all elements 
of matrix B can be piped into the 2-D systolic array as shown in 
Figure 2(b). 

At the ((m/2)+1)th step, the following computations are exe- 
cuted in PE, 1, PE1 9, ..., PE1.m respectively, where wi1, Wig, -..; 
Wim and 411, G19, ..., @im are piped from upper buffers into first 
row of the array. 


wh) = wi)? + ay, * b14, 


1 0 
wi) = wi?) + ayo * boo, 


wh = wh + dim * bam 

Then the w,, is piped into the lower&right neighbor PE, the 
by, is stored in each the RAM of each PE to be used repeatedly, 
and the azy is piped into the lower neighbor PE as shown in Figure 
2(c). After m-1 more steps, all elements of matrix A and W will 
be piped into the systolic array. 

During each computation cycle, all the PEs in the array per- 
form the same multiply-and-add operation. Each PE takes the 
value from its upper&left neighbor and adds it to azy * by,, and 
then passes the new value to its lower&right neighbor for use at 
the next step. At each computation cycle, the data stream of aj, 
is piped one row down and we, is piped one position lower&right 
while 6,, is waiting in the original PE. The result wz, will be piped 
into the bottom array buffers during the last m steps of the algo- 
rithm from the bottom boundary PEs. 


ALGORITHM 1( Matrix Multiplication Problem ) 


Procedure Multiply( var : A, B, W ; m-bym matrices ) ; 
begin 
re—]l; 
while ( r < 2 ) do (* Assume that m is even *) 
begin (* for 1 < i< m do simutanously *) 
if j <9 in PEi; 
then shift b,,, left 
else shift by,, right ; 
r—r+1; 
end ; 
store b,,;(*store by, into the RAM of each PE*) 
move @zy;(*move @zy from tM? into PE ;*) 
Move Wzz;(*move wz; from tM” into PE;,;*) 
repeat 
begin (* all PEs that have all their data available *) 
mult Agy byz ,Waz } 
move G@ry;(*into the lower neighbor PE*) 
move W,zz;(*into the lower&right neighbor PE *) 
end ; 
until there are no more data entering the PE ; 
end. 


Next, we prove that this systolic algorithm 1 works correctly by 
the following lemma and theorem: 


(Lemmai1): After the first m/2 steps of Algorithm 
1, PEG,j) will hold bj (j-i4+1)mod(m)s for all 1 < i, j < m. 
( Proof ) 

(i) 5; of the front memory modules at the first step will be 
loaded into PE,~;42,1 for all 1 < 7 < m. Then 6;; 1s moved 
lower&right during the rest (i-1) steps. Therefore b;; will halt at 


PE((m—j+2)+(i-1)))mod(m),14+(i-1) = PE(i~j41)mod(m),i- 


(ii) 6; of the rear memory modules at the first step will be 
loaded into PE,~j41,m for all 1 <j < m. 6; is moved upper &e ft 
during the rest m-i steps. Therefore 5; ; will halt at 
PE(m—j+1—(m-1))mod(m),m—(m—i) = PEG—j41)mod(m),i. 

By (i) and (1i), Pi; holds 6;,¢;-i41)mod(m)- 9 


(Theorem 1 ): Algorithm 1 generates the product matrix 
W of matrices A and B. 


From the algorithm, we know the following facts: 
(a) wij passes through m PEs from top down. 
_(b) The m PEs passed by w,; are 


PE 3, PE2 (j-41)mod(m); PE3(j42)mod(m) y+ PEm,(j4m—1)mod(m). 
(c) The A values of the PEs in (b) while w; ; passes are 


Bij Fi,(F4+1)mod(m)> Fi ,(74+2)mod(m)s--++1 %i,(j7+m—1)mod(m).. 


By Lemma 1, the B values of the PEs in (b) are b;,;, b(5.41)mod(m) 


b(j+2)mod(m),j 3++0430(j-+m—-1)mod(m),j stone (d) 
From (c) and (d), we know that the value of w;; is 


J? 


m—1 
» @i,(j+k)mod(m)5(j+k)mod(m),j 
k=0 
Therefore the algorithm generates the product of matrices A 
and B.O 


2.3 Performance Analysis and Comparison with Other 
Arrays 


In the proposed algorithm 1, a PE, once activated, performs the 
multiply-and-add operation at every computation cycle, until there 
are no more data entering the PE. At the peak of computation, 
about 100 percent of the m? PEs compute simultaneously. Average 
utilization of PEs is about 50 percent. This rate is higher than in 
other systolic arrays. 


The computation time is 3m - 1 time units assuming that each 
computing-and-routing cycle takes one unit of time. This can be 


verified by the following facts: 


(1) In @ time units, all elements of matrix B are piped into 
the array, 

(2) The time difference between the bottom and the top row 
elements of either A or W is (m-1) units due to startup time delay 
as shown in Figure 2, and 

(3) The bottom row elements of either A or W are piped out 
of the array after m computing cycles. 


It can be seen easily that our systolic algorithm has better per- 
formance than all other systolic algorithms in term of AT? measure 
for VLSI implementation as shown Table 1. For more efficient VLSI 
implementation, the operation of each PE is controlled locally and 
handles asynchronously[Peng & Jun 87, 88]. The basic idea is that 
a PE does not have to wait until the previous PE completes its 
computation. The approach speeds up the computation time by 
allowing the individual PE to operate independently to reduce the 
waiting time. The architectural design of PE will be discussed in 
next section. | 


Table.1: Comparision of Systolic Array 
| Systolic Array |} Time Complexity | Control System | 
a Cai TS ec 
| 


[[McCanny 86] | 4m-1____| synchronous _ 
[(S.¥-Kung 87] [| 4m2 | _asynchronous_ 


[ [this paper] | Sm1 | asynchronous _| 


2.4 Architectural Design of Processing Element 

In this section we will discuss the organization of the PE to 
be used as a basic module in our 2-D systolic array. In order to 
minimize the impact of both internal and external processor com- 
munication, each PE is partitioned into an Arithmetic Unit, three 


- network control I/O units which can operate asynchronously, an 
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EPROM( Errasible Programmable Read Only Memory ), a RAM( 
Random Access Memory ), and a control unit which can control 
the execution of both the memory unit and the arithmetic unit. 
The organization of the PE is illustrated in Figure 3(a). 

Each PE is connected to its three neighbors via indicated busses 
in the control unit. This device can be used to control its speed, 
internal RAM, I/O pins, and byte manipulation. The data in the 
PE move over a multiplexed n’-bits data/address bus. 

Computation is accomplished in the Arithmetic Unit, see Fig- 
ure 3(b). This unit is capable of performing several formats (fixed 
point or floating point formats). Many operations are available 
such as addition, subtraction, multiplication, division, square root, 
trigonometric functions, etc. 

The EPROM is used to store algorithms which perform data 
manipulation. This gives the system an efficient implementation so 
that the system can operate asynchronously and with local control. 
The RAM can be used for data storage. This memory space is 
important for partitioned matrix operations. 

The RAM can also be used for storage of programs during 
algorithm development. Once the algorithms have been completed, 
they will be put into EPROM. 

The control unit is partitioned into eight units as shown in 
Figure 3(d). They control RAM, EPROM, AU, and I/O units 
in each PE and are used to control I/O communication with the 
neighbor PEs in the network and allow messages to be sent to or 
received from several PEs simultaneously. Each NCIOU can de- 
termine whether the data are needed by the AU and/or other PEs 
with no intervention from the AU. Figure 3(c) 6 shows the basic ar- 
chitecture of one NCIOU. This unit is divided into four functional 
units. 


3. Applications for the new 2-D systolic Array 


3.1 Mapping a Partitioned Graph into a Systolic Array 

In this section, we describe a partition approach for large size 
directed graphs. The partition method is the key to extend the 
computational capability of VLSI architectures and to overcome 
difficulties in the mapping of a large-size graph into a fixed systolic 
array. 

A directed graph G with n nodes can be represented as a pair G 
= (V,E), where V is the set of vertices and E is the set of weighted 
edges in the graph. As shown in Figure 4(a) and (b), we use a 
matrix adjacency matrizt W to represent the directed graph with 
six nodes as follows. 

(1) wi; = the wetght from node 7 to node 7 if there is an edge 
from 2 to 7; 

(2) w;; = oo if there is no edge connecting 7 and J; 

(3) wy = Oifi=j. 

Suppose that the size of the systolic array is m”, the adjacency 
matrix W of size n? can be partitioned into k? submatrices each 
of size m?, where k = n/m. We can do computation on these 
submatrices separately using the proposed systolic array of size m? 
and then combine the results to get the solution for the original 
large size problem. 

The partition approach can avoid the unrealistic assumption 
that an unlimited number of PEs can be used to execute required 
computation. The size of the systolic system should be fixed once 
it is built. 


3.2 Shortest Path Problem 

In the all-pairs shortest path problem, we are required to pro- 
duce a matrix Wt of size n? such that wi; 1s the weight of the 
shortest path from v; to vj in G. 

We use Floyd’s approach for this problem. Let wf. denote the 
length of the shortest path from v; to v; that has intermediate 
vertices with indexes smaller or equal to k. Then w7; will be the 
shortest path we need. Since there are no negative weight cycles in 
G, we can easily modify the multiplication algorithm 1 to find Wt 
by replacing “+” and “x” in the multiplication algorithm 1 with “ 
min” and “+” as follows : 

twee = min{ Wear Why e whe } 


Following this modified version of algorithm 1 and the Floyd 
algorithm, the all-pairs shortest path problem can be solved here in 
5/2n - 1 time on the 2-D systolic array with n? PEs. It is illustrated 
by partitioning an example of a graph of order 6 using a systolic 
array of size 2-by-2 as shown in Figure 5(a) and (b). Here the ratio 
kis nfm = 3. 

The sequences of submatrix computations for the i th iteration, 
where 1 < 7 < 3, of this algorithm are shown in Figure 5(c). At 
the end of 2 th iteration, wzy is updated to be the shortest paths 
through intermediate nodes contained in submatrix W;;. This is 
done by comparing wry with wz; + wiy. The above processing 
is done through three steps (1),(2), and (3) of the algorithm as 


ALGORITHM 2 ( Shortest Weight Path ) 


Procedure allweight (var W: adjacencymatriz; k: integer); 
begin 
for i := 1 tok do 
(1) Wis = min( Wii, Wig + Wis) ; 
(2) for j := 1 tok with j #i do 
Wij := min( Wij, Wa + Wij) ; 
W3i = min( W;;, Wii + Wii) ; 
end-for ; 
for h := 1 tok with h #i do 
for g:= 1 tok with g i do 
Whg := min( Wag, Wai + Wig) ; 
end-for ; 
end-for ; 
end-for ; 
end ; 


(3) 


var 


depicted in figure 5(c). After k iterations, all wzy will have the 
correct values of the shortest path from x to y. 

The correctness of this algorithm follows easily from the Floyd 
algorithm. Now, we analyze the time complexity as below: For 
each i, the number of submatrix computations is 1 + 2(k - 1) + 
(k — 1)? = k?, and each submatrix computation takes 3m - 1 units 
of time. Therefore, the total computing time is (8m - 1)*k?*k = 
3n°/m? using a systolic array of size m2. All the computations at 
steps (2) or (3) in the algorithm can be done independently. There- 
fore, the time complexity can be further improved if there are more 
systolic arrays of size m? available. 


4. Conclusion 

In this paper, we describe a new systolic system for matrix 
multiplication and its relative problems. The proposed systolic 
array achieves better performance than the previously reported 
systolic arrays. Other advantages of our array include: (1) no 
need for global propagating control signals, (2) no need for data 
pre-arrangement, (3) no need for data interleaving to achieve this 
optimal rate, and (4) only two I/O lines required. 

For applications, we mention only the shortest path problem 
for directed graphs. There should be many other applications that 
the proposed system can be used. Image processing and signal pro- 
cessing are two possible examples. More research needs to be done 
to explore the potential of the new systolic system. 
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Abstract — Parallel algorithms for multiplying large integers 
are considered in the EREW PRAM model. A parallel addition 
algorithm is discussed, followed by a shift-and-add multiplication 
algorithm which runs in O(logn) time with O(n’) processors. 
Two FFT-based algorithms are discussed, one in the field of com- 
plex numbers, which runs in O(logn loglogn) time, and in the 
ring of integers modulo 2°+1 (Schoénhage-Strassen), which runs in 
O(log n) time. 


1. Introduction 


Many problems in Applied Algebra and Number Theory require the 
exact product of two very large n-bit integers, where n is too large 
for an ordinary machine multiply instruction. We assume no bound 
on n. We consider only the multiplication of positive integers, since 
the product of negative integers can be computed from their positive 
counterparts with a simple test for final sign. We discuss four paral- 
lel algorithms — one for addition, one for shift-and-add multiplica- 
tion, and two for multiplication using FFT’s. Detailed presentations 
of the algorithms discussed herein can be found in [15]. 


1.1. Parallel Processing Model 


There are many parallel processing models available [7, 9, 12, 16, 
18, 20, 24], each of which has advantages and disadvantages. We 
work within the context of a global shared memory model (some- 
times called the PRAM [24] model), where we have p processing 
elements (PE’s) connected to a global memory. The PE’s constitute 
an SIMD machine. Each PE is assumed to have local memory of 
size O(logn). We selected this model because it poses no architec- 
tural restrictions or delays on access to atomic data, and is conse- 
quently ideal for asymptotic time analysis. 


Three modes of the global shared memory model have been 
identified by [20]. These affect how PE’s may access the global 
memory. They are — CRCW (concurrent read, concurrent write), 
CREW (concurrent read, exclusive write), and EREW (exclusive 
read, exclusive write). The EREW mode is the most restrictive of 
the three [12]. We chose to use the EREW PRAM model, since 
‘CRCW or CREW could prevent us from easily transferring algo- 
rithms to a model or architecture which does not permit concurrent 
access. 


2. Parallel Addition 


We present an algorithm, based on ideas primarily from [11] and 
also from [7, 17, 18, 23], which adds two n-bit binary integers r and 
s, where n =2%. If we think of r +s asa modulo2" addition with a 
carry into the n-th position, we can write this sum as two 
modulo 2”* sums of the upper and lower halves of r and s with 
carry into the n-th position and carry between these two sums (i.e. 
into the (n/2)-th position). Applying this recursively log n times, we 
reach the ‘“bottom’’ where the sums are modulo2 and we have carry 
(modulo 2) into each bit position. 7 +s can be computed from the 
bit-wise exclusive-or of 7;, s;, and these carries. 


To compute the carries, we start at the ‘‘bottom’’ point and compute 

the carry out of the modulo2 sums with a boolean operation. We 
then unwind the recursion to the ‘‘top’’ (modulo 2” sum), generating 
*‘carry-out’’ data as we go. We then wind the recursion down again, 
distributing the carry-out information as ‘‘carry-in’’. This gives us 
the carry into each of the modulo2 sums 7; ® s;. The carry-in to 
the lowest sum is forced to be 0. This recursion requires O (log n) 
time with 2 PE’s. We refer to this as ALGORITHM A. 


This algorithm can compute r—s by inverting s and forcing the 
lowest order carry-in to be 1. : 


Catonsville, Maryland 
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BEGIN ALGORITHM A 


1.0 Let r and s be two n-bit numbers where n =2%N>0. Let 
ci jo fij, and z;,,; represent the j-th value of c, f , and z at the 
i-th level of recursion. c, f, and z are used in accumulation 
and distribution of the carry information. The large right 
braces imply parallel computation, with the range indicated to 


the right. We compute the (n+1)-bit sum t =r +5 thus: 
1.1 Initialize. 


FOR k «+ 1TON 
Cy — Ceaaiei V r-tainr A — O<jp <o" 
fii — fe-raia \ fie-12i 

ENDFOR 

Compute 


1.2 


1.3 
th < Cw.0 


ZNn,0 << 0 


14 FORk <—N DOWNTO 1 


Ze-1,2i < 2k, i 

Ze—1,2141 < Ce-ai V Fe-1ai N 2%) 
ENDFOR 
Compute 


fos: ore 


1.5 
t; — fo 2 | O<si<n 


END ALGORITHM A 


3. Multiplication by Direct Computation 


The ‘‘natural’’ method of multiplication we are taught as children 
consists of multiplying the multiplicand by the individual digits of 
the multiplier, shifting each of these results to the left as the order of 
the multiplier digit increases, and summing the shifted results to 
form the final product. Binary multiplication is a special case of this 
method where the values summed are simply repeated copies of the 
multiplicand shifted to the left i places wherever the i-th bit of the 
multiplier is 1. 


With O(n”) PE’s, we can distribute copies of the multiplicand over 
an O(n”)-sized memory in O(logn) time. We can also distribute 
the multiplier in O(ogn) time, and logical-and the multiplier and 
multiplicand bits (to remove the undesired rows) in constant time. 
This reduces the problem to that of summing 7 rows of (27)-bit 
numbers. If two rows can be added in constant time, we can sum 
them up in O(logn) time using a typical binary-tree summation. 
Clearly we cannot use ALGORITHM A for this, since summation 
would require O (log n) time. 


3.1. Carry-Save Addition 


There is a valuable addition technique known as a Carry-Save 
Adder [17], which we abbreviate CSA. A CSA can add three n -bit 
numbers giving two (n+1)-bit numbers in time independent of n. 
Specifically, if a, b, c, and d are n-bit binary numbers, and e is an 
(n+1)-bit number, we can compute d +e<-a +b +c by computing 


d; <a; ®b; @c; , 
Gir — (; Nc) V (a; AG; ® G)), 
eg <0. 


So, if we have n rows to sum, we can reduce this to 2n/3 rows in 
constant time with O(n”) PE’s. Since each reduction is by a factor 
of 2/3, it takes logz.n =O(logn) applications of this technique to 
reduce the number of rows to two, where they can be added in 
O(logn) time using ALGORITHM A. Therefore, we can multiply 
two numbers in O (logn) time with O(n’) PE’s. We refer to this as 
ALGORITHM B. 


4. FFT Techniques 


Owing to the relationship between the Fourier Transform domain 
and convolution, the Fast Fourier Transform (FFT) can be used to 
multiply finite polynomials, and hence integers, very rapidly. The 
use of the FFT for multiplication of finite polynomials has been well 
studied. Presentations can be found in [1, 5, 10, 16, 19] among oth- 
ers. The FFI was apparently first applied to this problem by 
Strassen in 1968 [10]. 


For computation of exact convolution results, there are two common 

families of FFT algorithms. One family uses the ring of integers 
modulo 2+1 for some « (the Schénhage-Strassen algorithm) [1, 10, 
19] and selects a large enough that the correct result is obtained. 
The other family uses the field of complex numbers with sufficient 
floating point precision to guarantee the result [10, 19]. The FFT in 
parallel has been discussed by [8, 16, 19, 21, 22] among others. 


4.1. FFT’s in the Complex Field 


The FFT has its origins [5, 6, 14] in the field of complex numbers. 
The nth root of unity is @=e°2"" where the transform is over a 
vector of n values. Strassen’s approach, as presented in [10, 19] 


works this way: Let u and v be n-bit numbers. Let 
2ns2l<4n, K=2¥, L=2', 


We view u and v as K-place base-L integers, whose upper K/2 
digits are 0. We perform k stage FFT’s on uw and v yielding their 


transforms [io, hi, ee lix-1] and [¥o, V1, eae Vr-1l, compute the 
piecewise products [Wo, W1,-°°* ,Wr_i] of the two transforms, and 


compute the inverse FFT [Wo, W),--- , Wx_,] of those products. 


Since the W;’s are the convolution of K-place base-L numbers, each 
W; is upperbounded by KL? and we cannot simply concatenate these 
results to form jhe answer. The desired result can be obtained by 


computing w = }\W;L'. 
ix 


We now examine the algorithm in detail. Select k and / such that 


k=l. k and / are both slightly smaller than logn. This makes for 


an awkward exact form for K, k, and 1, but they are approximately 
k =1= O(logn) and K=O(n/logn). 


All numbers are converted to binary fixed point fractions so that 
they will have a magnitude less than 1. Assuming we use m-bit pre- 
cision, the initial error in any number is 2™. Each complex opera- 
tion has one of the two forms a — b@+c ora < ba. Either of 
these forms will cause the loss of two bits of accuracy from both the 
real and imaginary components. 


We begin both forward FFT’s by shifting the base-L integers to the 

right ([+k) places. Shifting 7 places makes each number a fraction. 
Shifting k additional places ensures that none of the intermediate 
results exceed 1, since no butterfly can do more than double the 
intermediate values. Since the FFT is a linear operation [13], divid- 
ing the inputs by 2**' reduces the output values by the same factor. 
So the piecewise products have the form 


Wr-1 
92k +21 
must be multiplied by 27**” to reconstruct the desired answer. 
Therefore the error must not propagate any higher than 2-%*-*/-! = 


2“**-1 The & stages of the forward FFT’s induce 2k bits of error, so 
the error after the forward FFT’s is 2“-". The complex multiplica- 


= [22k the final f the i 
= SRT pet ‘ € final output of the inverse FFT 
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tion of the &’s and ’s induces an additional two bits of error, rais- 
ing the error to 27*-"*?, The inverse FFT induces another 2k bits of 
error, but the division by K reduces half of that, so after the inverse 
FFT the propagated error is 2°*-"**, Thus m must be at least 7k+3 
to limit the error to 2“*"!, Since k < logn, we simplify and use the 
figure 7logn as the number of bits required to obtain the correct 
result. 


Addition and subtraction of O (log n)-bit numbers can be done using 
ALGORITHM A in O dog logn) time without any significant altera- 
tion. The scaling in the FFT algorithm prevents additions from 
overflowing into a number greater than 1. Multiplication can be 
done using ALGORITHM B in O (log logz) time with the alteration 
that we shift rows to the right (for fractional accumulation) instead of 
to the left. 


In each butterfly in each stage of each FFT, we must perform one 
complex multiplication and two complex additions. These can be 
done in O(loglogn) time. With k =O (logn) stages, an entire FFT 
requires O (logn log log 7) time. 


The multiplication of the piecewise products takes O (loglogn) 
time. The inverse FFT is the same as the forward FFT. The divi- 
sion by K and multiplication by 27**”! at the end of the inverse FFT 
are simply shifts and are insignificant. Once the inverse FFT has 
been computed and properly shifted, we need to compute the final 


result w = })W;L'. Since each W; is no more than k+2/ =3 logL 
:=0 

bits long, we can rearrange them into three numbers, and use one 
carry-save addition and one application of ALGORITHM A to com- 
pute the sum. The entire convolution requires O (logn loglogn) 
time. 

We now address the computation of the sinusoid table. Each entry 
in the table must be accurate to 7logn bits. We compute the table 
by computing @ =e°-)"* using the conventional series for e*, then 
computing the powers of w from it. There are K=O(n/logn) 
powers of @ in the table. Each computation of a power of @ is a 
complex multiply which can be done in O (loglogn) time. O(logn) 
such computations are required, so the table can be computed and 
copied in O(logn loglogn) time. Each @; we compute is the com- 
plex product of no more than k other @, values, so only 2k bits of 
error are induced. If we compute @ with 9logn correct bits, we will 
have 7logn correct bits in each power of @. We can compute @ to 
this accuracy in O (log n log logn) time [2, 3, 4, 15]. 


This FFT algorithm multiplies two n”-bit integers in 
O (logn loglogn) time with O(n logn/oglogn) PE’s. We refer to 
this as ALGORITHM C. 


4.2, FFT’s in the Ring of Integers 


The classic form of the Ring-of-Integers approach is the well-known 
Schénhage-Strassen algorithm. This algorithm is relatively complex, 
and [1, 19] give good presentations of it. The FFT is computed in 
the ring of integers modulo some number whose results will be 
smaller than the original operands. These results must then be multi- 
plied together, which is done recursively with successively smaller 
and smaller FFT’s until the numbers are small enough to multiply 
with some temporally faster technique. 


Given two numbers u and v, we represent them as n-bit binary 
numbers where n =2* and u; =v; =0,n/2<i <n. Let b =2 #72! 
and let / =n/b. Let u and v be represented as b (/)-bit numbers 
such that 


p21 b-1 
u=>iuj;2" and v=>v, 2". 
i=0 i=O 


We compute the product of u and v modulo2’+1 using FFT’s 
where each butterfly is computed in the ring of integers 
modulo 2” +1. b and 1 are both approximately Vn , so a single FFT 
leaves us with Vn multiplications of (2Vn )-bit numbers. The algo- 
rithm is used recursively to perform these multiplications. The origi- 
nal multiplicands are padded with 0’s so the answer modulo 2”+1 is 
exact. Recursive multiplications do not require padding since we 
only need the modular congruent result. 


For speed in the butterflies, we use a two-valued representation for 
the numbers, and design each FFT butterfly computation so that it 
operates on this two-valued representation instead of the normal 
single-valued representation. If we have an x-bit number a, which 
is modulo 2*, we can represent a as the sum of two x-bit numbers a 
and ad (which we call ‘‘carry-save notation’? (CSN)), such that 


a modulo 2* = (a +a) modulo2 . 


The presentation in [1] uses butterfly computations modulo 2”'+1. 
Performing binary arithmetic on numbers in this ring is awkward. In 
[19], Schénhage and Strassen suggest computing in the ring of 
integers modulo2", then converting the mumber back to 
modulo 27"+1 when the computations are finished. Performing binary 
arithmetic modulo2“ is more convenient than modulo2“+1. This 
representation has duplicate values, but also has several noteworthy 
advantages, since 2” = 1 modulo2™+1 and 27 =-1 modulo2” +1. 


Multiplication by 2* is a shift left of x places. To maintain 
congruence modulo 2” +1, since 2“ = 1, each bit shifted off the left 
end of a number is shifted back into the right end. Thus multiplica- 
tion by 2* is simply a rotate left of x places. Addition with a CSA 
is the same — the carry out of the highest position is written into the 
unoccupied lowest order position of — sum. Since 
27 =—1 modulo2"+1, subtraction can be done with the formula 
a—b =a +2” b, and is consequently a combination of rotation and 
addition. 


The CSN number u =(u +i) modulo2" can be converted back to 
modulo 27'+1 in the following manner [19]: first add u and uw using 
ALGORITHM A (except that the carry out must be routed to the 
carry in). This converts u from CSN to a single (4/)-bit number. 
View u as u =278+e, where 6 and € are (2/)-bit numbers. Since 
2# =—1, we can write u =(e—5) modulo 2 +1, which is computed 


by 
eet 


With these concepts, we now give a parallel version of the 
Schonhage-Strassen multiplication algorithm. The majority of the 
presentation is similar to that given in [1], but we have made four 
significant changes. In the FFT’s, we perform the butterfly computa- 
tions modulo2“ instead of modulo 2™+1, and represent the numbers 
in CSN. At the end of each FFT we resolve the (4/)-bit CSN form 
of the results to single modulo 2” +1 numbers. The third change is in 
the multiplication modulo b. Schénhage and Strassen make clever 
use of the Karatsuba O(n'°8*) multiplication algorithm - we use 
ALGORITHM B instead. The fourth change is the final reconstruc- 
tion of the result in the last step. As we did with the last step of the 
complex-field FFT, we convert the summands to three large numbers 
and add them. The algorithm is shown at the end of this paper. 


e—§ 6>6 
watee-8 6 <8. 


(1) 


Resolving a modulo2* number in CSN to a_ single-valued 
modulo 27'+1 number using equation (1) requires a constant number 
of applications of ALGORITHM A and takes O (log /) time. 


An FFT requires log b stages, each of which computes b butterflies 
modulo 2". Each butterfly computation can be done in constant 
time. A complete FFT takes O (log b) time with O(b1) PE’s. 


Step 2.1 takes O (log b) time with O (61) PE’s. 
Step 2.2 takes O (log/) time with O (b/) PE’s. 


Step 2.3 is a recursive call to this algorithm for b pairs of (2/)-bit 
numbers. We will examine this step further in a moment. 


Step 2.4 is the same as step 2.1. 

Step 2.5 takes O (log!) time with O (bl) PE’s. 

Step 2.6 takes O (log log b) time with O(b log’b) PE’s. 
Step 2.7 takes O (log!) time with O (bl) PE’s. 

Step 2.8 takes O (logn) time with O(n) PE’s. 


Excluding step 2.3 for the moment, the dominant time required is 
O (logn) with O(n) PE’s. To include step 2.3, we need to know the 
total number of levels of recursion. We observe that 2/ decreases 
from level to level in the following manner: 
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2Vn ,2V2Vn ,2V2V2Vn ,--- 


which we can write as 
i r+] i 
9) 2 nz ,i=0,1,2,3,°°: 


1 


and is approximately 4n 2". Taking the log, we get 2““'ogn +2 
which is a constant when i =loglogn —1, so the recursion proceeds 
through O (loglogn) levels. Since the time required is O (logn) for 
a single level of recursion on arguments of size n, the 2‘ 'logn 
form for the log of the argument size gives an overall time require- 
ment of O (logn) with O(n log logn) PE’s. 


In [19], Schénhage and Strassen give a depth complexity of 
O (logn) for a logic net which implements their algorithm. We have 
shown that we can achieve the same time in the EREW PRAM 
model by representing the numbers in CSN and using CSA’s in the 
butterfly computations. 


5. Conclusions and Further Research 


The FFT is well-known as a useful tool for serial multiplication of 
large integers, and we have shown that it is quite powerful for paral- 
lel integer multiplication as well. 


Since the original preparation of this paper, more detailed work has 
been performed. Work has been done in three models - a version of 
the EREW PRAM model where each memory cell contains 1 bit and 
each PE can perform 1-bit boolean operations (abbrev. PRAM-1), a 
version of the EREW PRAM model where each memory cell con- 
tains 32 bits and each PE can perform 32-bit boolean and arithmetic 
operations (including multiplication) (abbrev. PRAM-32), and an 
interconnection model where 32-bit PE’s were connected via a barrel 
shifter (abbrev. Bsh). The performance was estimated based on run- 
ning time and on cost (running time < number of PE’s). 


In all models, ALGORITHM C had a consistently higher cost than 
ALGORITHM D. This is due largely to the high constant factor 
associated with the O (loglogzn) time required by the butterfly com- 
putations in ALGORITHM C. ALGORITHM B was faster than 
other algorithms, and for small values of n had a lower cost, but the 
cost grows much more rapidly, crossing ALGORITHM D at about 
n =2’. In the PRAM-32 model, the appearance of a 32-bit addition 
capability had a dramatic effect. ALGORITHM A is no longer 
necessary, since probabilistic arguments weigh heavily in favor of a 
simple ripple-carry addition approach with a small expected running 
time relatively independent of n. A faster version of ALGORITHM 
B was developed which used the 32-bit addition instructions instead 
of carry-save addition. The cost of this crosses the cost of ALGO- 
RITHM D at about n =2!*. The 32-bit multiplications make ALGO- 
RITHM C faster than ALGORITHM D for small values of n, but 
ALGORITHM D is still faster after about n =2!*. 


In the Bsh modei, ALGORITHMs C and D require about the same 
time. With a fast 32-bit multiplier and communication delays more 
than about four times the instruction time, ALGORITHM C is 
definitely faster than ALGORITHM D. Across the board, however, 
the cost of ALGORITHM D is still consistently less than that of 
ALGORITHM C. This apparent improvement in ALGORITHM C is 
actually the result of the difficulty encountered by ALGORITHM D 
when moved to an interconnection network. Except for the 
butterfly-to-butterfly communication, which is well understood, 
ALGORITHM C only needs to transfer data over a range of 
O (log?) PE’s in the floating point multiplications, while ALGO- 
RITHM D needs to transfer data over a range of O (Vn ) PE’s for the 
modulo 2 rotations. Once communication delays become a function 
of the distance travelled, the time required by ALGORITHM D 
increases. 


There are still a number of open questions, including the specific 
type of FFT to use and the possibility of VLSI implementation. We 
suspect VLSI implementation of ALGORITHM C would be more 
effective since it is simpler and has shorter PE-to-PE communication 
requirements. An investigation into the possible improvements 
which come from an MIMD implementation is also warranted. 


BEGIN ALGORITHM D 


Input: The input is two n-bit integers, u and v, where n =2*. At the topmost level of recursion, the uppermost n/2 bits 
of u and v must be 0. This is not required at any subsequent level of recursion. 


Output: The output is the (»+1)-bit product of u and v modulo 2’ +1. 


2.0 If n is small, multiply | u and v modulo 2" +1 using any method. Otherwise let b = 212] and let 1 =nib. 
Express u and v as u = >\u; 2" and v = Sv; 2", and convert each to CSN modulo 2“. 


i= i=0 

2.1 Compute the FFT’s modulo2”%+1 (using modulo2 CSN) of [uo, Wuyi, Wu2,---, We 'up_1] and 
[vo, Wry, Wva,°-:, wty,_1] where y=2”” and m=?’ is a b-th root of unity. 

2.2 Resolve the CSN modulo2" results [&’o,-+-*, 2’,-1] and [¥’o,°--,%’4-1] to their single-valued modulo 2” +1 
equivalents [%p,---: , &,-;] and [Vo, --- , ¥-;] using equation (1). 

2.3. Compute the pairwise modulo 27'+1 products [(4)x¥o) modulo 27'+1, + - - (i,_1X%,-1) modulo 27+1] of the two 


FFT’s by recursive use of ALGORITHM D. 


2.4 Compute the inverse FFT modulo2™+1 of the pairwise products from step 2.3. 


The result is 


[o, wo,,---, wo W,_1] where each yw’; product is modulo 2” and is in CSN. 
2.5 Compute the single-valued numbers w”; = w; modulo 27 +1 by first computing 


Ww; =(y'v; x") modulo 2" , then computing w”; =; modulo 27'+1 using equation (1). 
2.6 Compute w’;=w; modulo b by computing w’; < ((u; modulo b)x(v; modulo b)) modulo b using ALGO- 


RITHM B. 


2.7 Compute the exact w; values by computing w; =(27+1)((w’; -w”;) modulo b)+w”;, where each w; is positive 


and no more than b 2”. 


b-1 


2.8 Construct three n-bit numbers with non-overlapping sequences from )w;2" modulo2"+1 and add them 


modulo 2"+1. This is the desired result. 
END ALGORITHM D 
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Abstract -- During L/U decomposition of a sparse matrix, it is pos- 
sible to perform computation on many diagonal elements simultaneously. 
Pivots that can be processed in parallel are related by a compatibility 
relation and are grouped in a compatible set. The collection of all maxi- 
mal compatibles yields different maximum sized sets of pivots that can 
be processed in parallel. Generation of the maximal compatibles is based 
on the construction of an incompatible table which gives information 
about pairs of incompatible pivots. The algorithm to generate all maxi- 
mal compatibles involves a binary tree search and is exponential in the 
order of the matrix. A technique to obtain an ordered compatible set 
directly from the ordered incompatible table is given. The ordering is 
based on the Markowitz number of the pivot candidates. This technique 
generates a set of compatible pivots with the property of generating few 
fills. A new heuristic algorithm is presented that combines the idea of an 
ordered compatible set with a limited binary tree search to generate 
several sets of compatible pivots in linear time. An elimination set to 
teduce the matrix is generated and selected on the basis of a minimum 
Markowitz sum number. The parallel pivoting technique is a stepwise 
algorithm and can be applied to any submatrix of the original matrix. 
Thus it is not a preordering of the sparse matrix and is applied dynami- 
cally as the decomposition proceeds. Parameters are suggested to obtain 
a balance between parallelism and fill-ins. A sample result of applying 
the proposed algorithms on an application matrix using the HEP mul- 
tiprocessor is presented. 

1. Introduction 


In this paper we present a parallel triangularization algorithm for 
solving large, sparse systems of linear equations. The algorithms 
described are designed for a shared-memory MIMD model for parallel 
computation, in which the total memory address space is accessible uni- 
formly to all parallel units. This computational model provides syn- 
chronization mechanisms to allow multiple memory updates. If multiple 
updates are aimed at the same memory cell, the penalty paid is a short 
delay in access time. 


The triangulation of an nxn matrix A = [ajj] can be described by 
the following procedure. 


for K =1,2,....2—-1 and for each aj,40 


For each pair aix-axj #0 
Ajj <— Aij —Gik XAkj i>k, j>k (1.2) 


In (1.2) if aij=O but aix-aej 40, a fill-in is generated. If we have sufficient 
processors, the divide operations (1.1) for each column K can be done in 
parallel. Also, for each k the update operation a .2) for all pairs 
Giza; #0 can be done in parallel. Our experience in employing this 
approach has indicated that the sparsity of application matrices leaves 
parallel processes with little work to perform if only reduction for a sin- 
gle pivot is done in parallel [1]. During Sparse LU decomposition it 1s 
possible to perform computation on many diagonal elements simultane- 
ously. In parallel LU decomposition of general unsymmetric sparse 
matrices several key issues must be considered: 


Parallelism and fill-in are two competing issues and a balance 


4 ° . > ° 
between the two must be obtained. In other words minimizing 
fill-ins results in limited parallelism, and maximizing parallelism 
results in uncontrolled generation of fill-ins. 
b) _A test for numerical stability of pivots must be made to ensure the 


accuracy of the solution process. 


+Research was supported in part by NASA Contract No. NAS1-17070 and by the Air 
Force Office of Scientific Research under Grant No. AFOSR 85-1089 while the author was 
in residence at ICASE, NASA Langly Research Center, Hampton, VA 23665. 
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c) In applications where the sparse linear system must be solved 
repeatedly, it must be possible to decompose structurally identical 
matrices using the information produced for the first decomposi- 
tion of such matrix. 

d) A storage structure suitable for parallel processing must be deter- 


mined. 


In the remaining of this paper we will present the design of a heuristic 
algorithm [2] [3], and [4] which identifies parallel pivot candidates and 
allows the matrix to be reduced for multiple pivots simultaneously while 
it minimizes fill-ins. Other parallel pivoting strategies have been sug- 
gested [5], [6], (7], [8], [9], [10], [11], [12]. The proposed algorithm is a 
dynamic algorithm which can be applied at any point during the decom- 
position phase and does not require a preordering of the input matrix. 
Therefore pivots can be tested for numerical stability and unsymmetric 
permutations are possible between consecutive applications of the paral- 
lel pivoting algorithm to the sparse matrix under consideration. This 
technique also allows the decomposition of structurally identical matrices 
to be carried out as required in (c). 


A description of the parallel pivoting algorithm is given in section 
2. Section 3 gives an analysis of the order of the algorithm. Section 4 
describes several program-controlled parameters to control generation of 
fill-ins. Finally in section 5 some results and concluding remarks are 
given. 
2. Parallel Pivoting Algorithm 


In the technique described here parallel pivot candidates are 
obtained from the diagonal elements of the matrix, thus allowing only 
symmetric permutations during the application of the parallel pivoting 
algorithm. However, since it is a stepwise algorithm, it is possible to 
perform unsymmetric permutations in between the applications of the 
parallel pivoting procedure. 


Pivots that can be processed in parallel are related by a compatibil- 
ity relation and are grouped in a compatible. In other words pivots 
Pi, Pj, Px are compatible and can be processed in parallel if and only 
if elements jj, Gji, Aik, Aki, Ajk, Ay; are all zero. A compatibility relation 
classifies the elements of a set into nondisjoint subsets, so that all 
members of a subset are compatible. Thus collection of all maximal 
compatibles [13], yields different maximum sized sets of pivots that can 
be processed in parallel. Several methods for generating maximal com- 
patibles exist and they are all based on the construction of an implication 
(incompatible) table. The incompatible table gives information about 
pairs of incompatible pivots. Let the incompatible table be represented 
by an array, imptbl, of dimension n, order of the matrix, with elements 
of imptbl being sets of n elements each. Each set corresponds to a 
column of the table. If we assume the diagonal elements of the matrix 
are numbered 1 through n, then column i of the table, imptbl; holds the 
incompatible information for pivot i of the matrix. Figure 2.1.b shows 
the incompatible table for the matrix of Figure 2.1.a in the given order. 
Column one of the table is constructed by scanning row and column 1 of 
the matrix. For each nonzero encountered, the corresponding position in 
imptbl; is marked. Therefore pivot pi is incompatible with pivots p7 and 
ps. Next, ignoring row/column 1 of the matrix, imptbl2 is constructed by 
scanning row/column 2. The process is repeated until all (n-1) columns 
of the table are complete. 


A systematical approach for extracting the maximal compatibles 
involves a binary tree search. This approach is exponential in the order 
of the matrix, however, its description is essential in the derivation of the 
target algorithm. Initially, it is assumed that all pivots are compatible. 
They are grouped in one set consisting of all pivot. This set, S , will be at 
the root of a binary tree, level zero. Next, the set of pivots incompatible 
with, P 1, obtained from imptbl; is used to split § into a left $1 and a right 
S2 set, constituting level one. S; consists of all elements of its parent § 
except those incompatible with P;. S2 consists of the same elements as 
S except P, itself. Next using imptbl2 we split every set at level. 1 to 


COn~IA NM Bh WN = 
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Figure 2.1.b Incompatible Table 
generate the four sets at level 2. This process is repeated until no more 
splitting of the sets are possible. Figure 2.1.c illustrates this procedure 
for the matrix of Figure 2.1.a. The leaf sets are checked and any set 
included in a larger set is eliminated. The remaining sets constitutes all 
possible maximal compatibles. 


Among the maximal compatibles generated, sets b, c, and d are of 
maximum size 4. This suggests we need a mechanism to select a set 
among all the candidate sets which would generate fewest fill-in. The 
technique used here is based on the Markowitz criterion for minimizing 
fill-ins in sparse matrices in sequential programming. The Markowitz 
[14] number of an element aj; is defined by (7;—1)(c;—1), where 7; and 
c; are the number of nonzeros in row i and column j of the reduced 
matrix respectively. At each step, the element with minimum Markowitz 
number is selected as pivot. The strategy used here is called 
Markowitz Sum and is the sum of the Markowitz number of all pivots in 
a set. At each step, among all the maximal compatibles of equal max- 
imum size we select the one with minimum Markowitz Sum to reduce 
the matrix. Sets b, c, and d have Markowitz Sum of 11, 22, and 22 
respectively. Therefore the compatible pivots 1,3,4,6 in set b is the one 
to be selected to reduce the matrix in parallel. 


The exponential characteristics of this algorithm prohibits its use. 
However, over a number of analysis done for several small test cases 
arising from electronic circuits, we found that many parallel computation 
steps are possible and during these steps the matrices are often reduced 
completely. Most importantly our results show that by reducing the 
amount of parallel work at each step slightly it is possible to reduce the 
generation of fill-in significantly and still reduce the matrix in the same 
number of steps [2]. The goal of our heuristic algorithm therefore is to 
obtain enough parallel work by just considering a sub-maximal set of 
compatible pivots at each step. 


By a more careful analysis of the incompatible table we can pro- 
duce a compatible without searching the binary tree. Column i of imptbl 
yields all pivots p;, j >i, incompatible with p;. If imptbl; is null, then p; 
is compatible with every pivot p; with j>i. By scanning the table, we 
find a set of pivots, compset, whose columns in the table are null. 
Clearly, such pivots are compatible. Furthermore, if the set of 


incompatible pivots of px are disjoint from compset , then pz is also com- 
patible with every pivot in compset. The procedure to produce an 
ordered compatible , compset , can be summarized as follows: 


scan imptbl from right to left 
for each column i of imptbl do 
begin 
if (imptbl; (>) compset is empty) then 
(*add [i] to the set of compatibles*) 
compset = compset + [i] 
else 
delete row i of imptbl 
end 
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[1,2,3,4,5,6,7,8] 


o 


[1,2,3.4,5,6] 2% 
(1, 4] [1,3, 9,0] [2,4,7,8][3,4,5,6,7,8] 
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d /\ 
[3,4,5,614.,5,0,7,8] 
(in b and 7 \ 


[4,6,7,8] [5,6,7,8] 
in b,c an 


Figure 2.1.c: Binary Tree Search to Obtain the 
Set of Maximal Compatibles 


Applying this procedure to the incompatible table of Figure 2.1.b results 
in: 


[1,3,4,6] [1,3,5,6] 
b Cc 


Null columns = (7,8), Compset = (4,7,8) 
Markowitz Sum = 1+ 12+3 = 16 


Pivot 7 is the pivot with maximum Markowitz number in the matrix. We 
would like to have compatible pivots with as low Markowitz numbers as 
possible in order to minimize fill-in. It is clear that pivots with low Mar- 
kowitz number generally have fewer incompatibilities. Changing the 
order of construction of the incompatible table can eliminate the inclu- 
sion of pivots with high Markowitz number in the ordered compatible . 
The table is constructed for pivots in decreasing order of Markowitz 
number. This is illustrated for the same matrix of Figure 2.1.a in Figure 
2.1.d, where a larger compset , size 4, of lower Markowitz Sum, 11, is 
obtained. 


Now we can combine the idea of an ordered compatible with the 
tree search algorithm to obtain a limited tree search algorithm which pro- 
duces an acceptable set of compatible pivots for reducing the matrix. 
This is done by partially searching the binary tree up to a given level, 
ULEVEL, generating a number of sets. Each of these sets are some sub- 
set of the root and can be considered as a starting set. Thus by scanning 
the incompatible table corresponding to each of the starting sets at 
ULEVEL, we can construct an ordered compatible . 


Among all ordered compatibles, the one of maximum size and 
minimum Markowitz Sum is chosen as the elimination set to reduce the 


5 

6 
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Markowitz Number 12 12 6 6 3 2 2 

Figure 2.1.d | Ordered Incompatible Table 


Null columns: (1,3,4), Compset: (1,3,4,6) 
Markowitz Sum: 2+2+1+6 = 11 

matrix. Different orderings for producing the starting sets at ULEVEL 
have been considered [2], [3]. The most promising one is to split the 
sets with pivots in increasing order of their Markowitz numbers. This 
process seems to give a good balance to the binary tree for the first few 
levels used to generate the starting sets. It also has the property to keep 
pivots of low Markowitz numbers within the starting sets. 


3. Order of Parallel Pivoting Algorithm 


The algorithm is no longer exponential in time. It consists of the 
following steps: 


1. Construct the incompatible table. 
2. Sort pivots according to Markowitz numbers. 


3. Produce starting sets at ULEVEL . 


Generate an ordered compatible for each of the starting sets at 
ULEVEL . 


The algorithm involves some set manipulation operations. These 
operations are listed for each section separately as required. 


1. Construction of the incompatible table requires scanning the 
non-zero elements of the matrix. The set operations involved are adding 
an element to a set and test for membership which are both O (1) opera- 
tions. Therefore, the incompatible table can be constructed in O (NZ) 
operations where NZ is the number of nonzeros in the matrix. 


2. A sorting algorithm is needed to order the pivots according to 
their Markowitz numbers. The Batcher sort [15] is used here and it has 
been shown that with enough parallel operations, sorting is completed in 
1/2[ log ny ([ log nl +1) steps. Employing an efficient sort would improve 
the performance of the algorithm. 


3. Production of all starting sets at ULEVEL takes a constant time, 
say K, proportional to the number of starting sets. It invoives the set 
operations intersection, difference, deletion of an element from the set, 
and test for a null set. These operations are of order n with a constant 
factor equal to the inverse of the number of bits per computer word. The 
set operations are usually implemented in machine language or micro- 
- code and thus have a small time factor. They could be considered to 
have a constant time (rather than order of n) compared to the time taken 
to execute a high level language statement. Therefore, an efficient 
implementation of the set operations is important to the efficient execu- 
tion of the algorithm. If we denote the time to do a set operation with 
setop then this section can be done in O (K-setop). 


4. Generation of an ordered compatible from the incompatible 
table requires scanning n sets corresponding to the columns of the table 
and performing intersection and difference operations on the sets. Thus 
ordered compatibles can be produced in O (K -n-setop), where K is the 
number of starting sets at ULEVEL . For reasonable values of ULEVEL , 
all ordered compatibles can be derived in parallel for different starting 
sets. 


4, Fill-in Minimization 


It is possible to minimize generation of fill-ins significantly by 
reducing the amount of parallel work slightly. Trading off parallelism 
for fill-in reduction is done according to the size of the elimination set 
and a number of parameters: 

1. Shrinkage parameter: By allowing a small percentage of the 
elimination set to be discarded we can control the number of com- 
patible pivots to a degree that does not limit our parallel work by 
too much. 


2. | Upper limit parameter: This limit would allow just enough parallel 
work to keep our parallel processes busy. 


3. Threshold parameter: In shrinking the size of an elimination set 
only pivots with Markowitz number higher than a threshold value 
in the ordered list of pivots may be discarded. Pivots with low 
Markowitz numbers do not tend to generate many fills and need 
not be discarded. 


5. Result and Conclusion 


The above algorithm has been simulated on a VAX 11/780 and 
implemented on the HEP pipelined shared-memory computer [16]. Test 
cases from application programs such as the SPICE circuit simulation 
[17] and SPAR structural analysis program [18] have been used and 
results of different analysis are available [2]. Here we represent the tim- 
ing results of running the parallel program over a 144 by 144 matrix 
from the circuit of an 8-bit full adder and employing different values for 
trade off parameters. Figure 5.1 represents the execution time of parallel 
LU Decomposition program for different numbers of processes from 1 
to 25. The result is for the case when maximum parallelism is used. For 
NPROC=1, the matrix is completely reduced in 10 parallel steps. The 
number of compatible pivots at each step is 72, 25, 16, 11, 6, 5, 3, 2, 2, 
and 1 respectively. Note that in the first step half of the matrix is reduced 
in parallel. The execution time decreases with an increase in the number 
of processes up to NPROC=11. In fact there is 1/VPROC reduction in 
execution time for small values of NPROC as new processes make 
efficient use of the execution pipeline. This decrease in execution time 
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bottoms out as the pipeline becomes full (the execution pipeline on the 
HEP has eight steps, resulting in a speed up of about 8). The slope of the 
linearly rising tail of the curve indicates the length of time spent in criti- 
cal sections in various points in the program. A complete model for 
analysis of parallel programs can be found in [19]. Defining the speed up 
to be: 


S =T(1)/T(NPROC) 


where T(1) is the time to execute the program with one process and 
T(NPROC ) is the same time using NPROC processes. Then a speed up 
of 4.82 is obtained for 11 processes. Note that this is not speed up meas- 
ured with respect to the best sequential algorithm, but only gives insight 
to the parallelism in this program. 


For a small number of processes, execution time versus number 
NPROC of processes can be represented as: 


T (NPROC) = C1+C2/NPROC 


where C, represents the sequential portion of the work and C2 the paral- 
lel portion. A simple least squares fit to determine C; and C2 is applied 
to a linear portion of the execution time versus NPROC curve to estimate 
the degree of parallelism. This analysis shows that the code is 87% 
parallel. The speed up of 5.81 was obtained when the following trade off 
parameters where used 


Threshold 1/3 Shrinkage Parameter 30% 
Upper Limit 25 ULEVEL 4 
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Figure 5.1 Execution Time vs. Number of Processes 


No Trade off 
144x144, NZ=616, 8-Bit Full Adder 


The higher speed up indicates that by employing the above parameters a 
better balance between number of compatible pivots generated at dif- 
ferent steps is achieved. A reduction of 23% in fill is obtained as the 
result of the above parameter variations which compares reasonably with 
results from the best sequential program. The fill-in can further be 
decreased by assigning different values to trade off parameters. 


The sparse LU decomposition technique described in this paper 
employs a parallel pivoting strategy to solve the problem of having 
enough parallelism in sparse matrices. The main features of the heuristic 
algorithm can be summarized as follows: 


-It can identify a good set of parallel pivots in linear time. 


-It is a stepwise algorithm and can be applied to any submatrix of 
the original matrix. Thus it is not a preordering of the sparse matrix and 
is applied dynamically as the decomposition proceeds. 


-Pivots can be tested for numerical stability and unsymmetric per- 
mutations can be performed if necessary. 


-Trade off between parallelism and fill-in is possible under several 
program controlled parameters. 


-The information produced by the algorithm can be stored to 
decompose structurally identical matrices. 


We have presented the parallel reduction combined with parallel 
pivoting technique, control over the generation of fills and check for 
numerical stability, all done in parallel with work being distributed over 
the active processes. The program verifies that it is actually possible to 
do parallel pivoting in sparse matrices on multiprocessors and take 
advantage of the existing parallelism in the problem and in the hardware. 
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Optimal Decomposition of Matrix Multiplication 
on Multiprocessor Architectures 


Whelan, M., Gao, Guang R. 1 and Yum, T. Ie? 
Philips Laboratoris, North American Philips Corporation, 
Briarcliff Manor, NY 10510 


1. Introduction. Exploiting parallelism in matrix 
multiplication on a shared memory multiprocessor will 
encounter problems common to many other 
applications: determination of both an appropriate 
grain size for the parallel tasks for each processor, and 
an allocation which keeps the processors usefully busy. 


In this paper, we study the problem of decomposing 
matrix multiplication so that it achieves optimal 
speedup on a multiprocessor. In scientific and 
engineering applications, matrix multiplication is 
frequently used as a kernel benchmark in the evaluation 
of supercomputer performance [4, 5]. Many other linear 
algebra algorithms may be transformed or reduced to 
contain matrix multiplication as their kernel [1]. 


Our approach is based on the observation that matrix 
multiplication has a regular iterative nature and that 
task decompositions based on iterations have fixed 


computation and communication patterns. Therefore, 
an analytical model calculating the cost of such 
computation/communication patterns is possible. 


Compared with other recent related work |2, 7], our 
work computes the effect of computation and 
communication cost in the case of the execution of a 
single matrix multiply as opposed to the overlapped 
execution of many matrix multiplies. The results of 
this work derives optimal partitioning and allocation 
strategies for matrix multiplication on a shared memory 
multiprocessor. 


In section 2, we describe the model of a 
multiprocessor architecture which is used in the 
analysis. In section 3, we formulate the decomposition 


problem, and derive analytically the optimal condition 
for program partitioning and an optimal allocation 
strategy. In section 4 we present a summary of 
simulation experiments which were performed to 
determine the validity of assumptions made in the 
analysis. In section 5, derive an optimal speedup 
function. In section 6, we discuss computation / 
communication tradeoffs, and in section 7 we conclude 


1 Ourrent address: McGill Univ., School of Computer Science, 
805 Sherbrooke St. West, Montreal, Canada H3A 2K6 


“Current address: Nynex Corporation, White Plains, NY 


with a comparison of our work with other related work. 


2. Architecture Model. The main features of the 
machine model our study is based on are: 


1. There are n identical processors (PEs) connected by a 
single bus. 

2. The bus is a packet oriented data transfer medium, 
as opposed to a backplane interconnect . 


3. The memory system is organized in two levels. At 
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the first level, each PE has its private local memory 
(PM) and a cache (CM); and at the second level, 
there is a common shared memory (SM). 


The time to transfer a block of L words from SM to a 
CM is T, where L= T+ LXT,, Where T. and T,,, are 
respectively the packet overhead and the transfer time 
per element of data after start-up. Define 
T, =(T,+LxXT,,)/L i, T, is the average time to 
transfer a data element. Each PE can compute 
multiply accumulate operations in the context of a 
matrix multiplication at the rate of one per 7, time 


units. It must be noted that this cost must include the 
average cost of the conditional tests, loop counters, and 
address arithmetic which must be performed on a real 
machine. 


In determining the load a task imposes on the shared 
bus, we do not account for the following items. 


1. Fetching code from the shared memory. 
2. Cache organization, e.g. replacement strategy. 
3. Finite cache sizes. 


Our rationale is that 1 is minimal since the code 
required to specify matrix multiplication is very small. 
Points 2 and 3 are justified on the assumption that 
even for relatively small caches, the frequencies with 
which cache collisions occur is very small, thus once a 
variable is accessed once by a processor, it will 
thereafter with very high probability reside in that 
processor’s cache. It is our contention that within 
certain constraints, the analysis will still accurately 
predict the properties of a realistic machine. The 
results of simulations, which do account for these 
effects, verify this assumption. 


3. Optimal Decomposition. The impact of 
decomposition strategy on the performance of matrix 
multiplication can be demonstrated by considering the 
following three decompositions for C= A XB (where A 
and B are MXM matrices): (1) uniprocessor; (2) M 
processors with each computing one row of C; (3) Me 
processors each computing one elements of C. 


y is O(T,..) 

9) cm tot 

i M° T, 2M’ T, M° T, 
2 2 

2. MT, (1+M)M*T, MT, 

3. MTy 2M°T, 2M°T, 


Figure 1: Three simple partitioning cases. 


The first two columns of the table in Figure show the 
computation time per processor Tp and communication 


time per processor on required for each of the three 


cases. The third column of the table lists the dominant 
term of the total time. We assume that the matrices A 
and B are stored in shared memory and their elements 
will be transferred to processors as required by the 
computation. We may immediately observe that 
between cases 2 and 3 the total time doubled although 
M times as many processors are used. | 


For simplicity, we study the multiplication of two 
MXM square matrices A and B, although the extension 
to arbitrary matrix. multiplication is relatively 
straightforward [8]. In this paper, the cost metric being 
used is total elapsed time. Thus, optimizing 
performance means minimizing the total time required 
to complete a single matrix multiplication. It should be 
noted that this is not the cost function used in other 
related work [2], where throughput is used. 
Throughput is an appropriate cost measure if one has 
many such computations to be performed and the 
computations may be overlapped. However, if one does 
not have many such computations, or, if the 
computations cannot be overlapped (e.g because of 
dependencies), then elapsed time is more appropriate. 


The resultant matrix C is partitioned into rectangular 
submatrices to be computed by each PE using the basic 
block matrix multiplication algorithm [6]. The two 
aspects of decomposition strategy are closely related to 
each other, and must be considered together in order to 
determine the shape and size of the submatrices as the 
optimal condition for decomposition. The computation 
and communication of each task should be arranged so 
that optimal performance may be achieved. 


It is natural to speculate that a square shape would be 
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an optimal choice of the submatrices. This is indeed 
the case, although it is also the case that any row 
column interchange equivalent to an optimal square 
partitioning is also optimal [8]. For the remainder of 
this paper, we will assume that square partitioning is 
used, that is the result matrix C is partitioned into 
square submatrices to be computed by individual PE’s. 
We must now determine the optimal size of the 
submatrices, and the manner in which a processor will 
compute its submatrix. 


3.1 Problem Formulation. The matrix 
multiplication is partitioned into HxXH _ square 
submatrices of C. For simplicity assume M is divisible 
by H. The number of PEs required is n = M?/H?. 


The manner in which PEs compute their subtasks 
must now be chosen. We will use the outer product 
method, with a parameter R determining the size of the 
subtask. The partitioning strategy is: 


1. Each PE is assigned the task of computing a 
submatrix of C, e.g. Cy y , IJ in {0,1,2...(M/H-1)} . 

2. Tasks assigned to PEs are further divided into M/R 
subtasks (1<R< M), each subtask computes a 
partial result of Cy j using the outer product method. 


This task decomposition be 


algorithmically as : 


FORALL I,J in (0 ... (M/H - 1))/* M*M/(H*H) tasks*/ 
FOR i,j in (1..H) C(CI*H+i,J*H+j)=0; ENDFOR/*init*/ 


can expressed 


FOR 1 in (0... (M/R - 1) ) /* the subtasks */ 
FOR k in (1 .. R) 
FOR i,j in (1 .. H) 


CCI*H+i, JHH+4) = CCI*H+i, J#H+j) 
+ ACI¥*H+i,1*R+k) *BC1*R+k, J*Ht+j) ; 
ENDFOR 
ENDFOR 
ENDFOR 
ENDALL 


The allocation of subtasks to the PEs is done in an 
overlapped fashion. The subtasks are grouped into sets 
with the k" set containing the subtasks computing the 
ith partial results for each PE. Hence there are a total 
of M/R sets and the number of subtasks in each set is 
n. The computation and communication is performed 
in rounds, each round computes one set of the subtasks. 
For our purposes, each subtask can be considered as 
one indivisible access and computation process, i.e. 
HXR elements from A and RXH elements of B are 


read into by a procesor and the partial result is 


computed. 


3.2 Partition Regions.In Figure 2, we show several 
different partitions and allocations (for M16), where 
shaded and unshaded boxes indicate communication 


(A) r=16, h=4 


tie 


Computation/communication overlap. 


(By) r=4, = 4 


Figure 2: 


(C)r=4, h= 


and computation respectively. Let T,4, be the total 


time for performing the entire matrix multiplication. 
Let bop? aan be the time for the computation and 


communication time for an individual subtask. In 
Figure 2 (a) R= M, note that there is only one round 
in the process. In Figure 2 (b), the partition is such 
that the communication time dominates the total time. 
We can see that the bus will be busy most of the time. 
During the process, processors become idle while 
awaiting access to the bus. In Figure 2 (c), the 
partition is such that the computation time is 
dominant. In this case the processors are busy most of 
the time, but the bus becomes idle from time to time. 


As illustrated by the examples, the system may 
operate in two different regions depending on the 
partitioning: computation bounded or communication 


bounded. 


3.3 Overlapped Computation/Communication. 
For a subtask, we have: 


t p= XRXTs 


7 (3.1) 


tom  2KAXEXT, (3.2) 
where T, and T, are as defined in Sections 2. In the 


computation-bounded region the total time is 


= 3 
“tot, mar i (toy Ht) + (DX tr (3-3) 
=H"? X MXTx +2HXT, X(M+R(M?/H?-1)) (3.4) 


In the communication-bounded region the total time is 


a 


tot (3.5) 
c 


M 
= XB bom — top 


= 2X——xT, + H'XRXTs (3.6) 

We note that as R increases, both equations will result 
in longer finishing times. Therefore we should choose 
the minimum possible R to reduce the total time in 
either case. Therefore, letting R= 1, equations (3.4) 
and 3.6 become: 
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Pe =H? X MX Ty+2HXT, X(M+(M/H)? —1) (3.7) 
p 


M* 2 

Tyot = 2XapXT, + XT (3.8) 
Now let° us determine the value of H which will 
minimize the total time, this will be referred to as A oot: 
Let us first determine the value of H at which the 
computation and communication times are totally 
overlapped. This happens precisely when the 
computation time of a subtask on one processor can be 
exactly overlapped with the communication time of the 
other (n-1) processors. Under this condition, for 
example, the i? round is ready to start to use the bus 
for the first subtask at exactly the moment when the 
communication for the Gai round completes. In this 
case, we have: 

(n—1)t_. = a (3.9) 
Substituting (3.1) and (3.2) into (3.9) and noting that 
n = M?/H?, we obtain: 

H XT, +2XH° XT, -2XM° xT, =0 (3.10) 
Equation (3.10) establishes the key condition for 
maximum overlap between computation and 


communication time for matrix multiplication. 


We have derived the condition of partitioning for 
maximum overlap from the argument of optimum tiling 
of the computation/communication timing graph. We 
have also deduced separately the formula for computing 
the total time valid for either the computation-bounded 
or the communication-bounded region such as (3.7) and 
(3.8). It is easy to show that the value of H at the 
intersection point of T,,(H) and Trot ) satisfies 


equation (3.10). This can be shown by subtracting (3.8) 
from (3.7) and setting the result to zero, which yields 
equation (3.10). Thus the maximum overlap condition 
happens at precisely the intersection .of the computation 
and communication bounded region. 


3.4 Optimal Partition Size. It seems intuitive 
that the value of H which satisfies equation (3.10) 
should lead to a minimum total time because the 
processors and the bus are kept usefully busy all the 
time (except for initial and terminal transient 
behavior). From (3.7), we can compute the derivative 
of T4412 the computation-bounded region and rewrite 


it in the following form: 


= ———x(H° XT + HT, 


1 
—~—xH*° xT, — MXT,) (3.11) 


According to (3.10), when H > Hy, We have 


t 


HP XT, +H’ XT, >2XM°XT, -H?XT, (3.12) 
Therefore, from (3.11) and (3.12) we deduce that when 
H> AL oot the following holds in computation-bounded 


region: 
tot 2M x(x MH? x HP 3.13 
eg Ie =H x —M) (3.13) 


Since M > H, from (3.13) we claim when H > FL oot 


the following is true for all M > 2: 
dT, 


ot 
>0 (H>H,,,) 

Similarly, from (3.8) we can 
communication-bounded region 

dT, 

ot 

TH < 0 (H<H,,,) (3.15) 
From (3.14) and (3.15), we know that the total time in 
the communication bounded region is a decreasing 


function of H until H=H,. After that, in the 


computation-bounded region the total time become an 


increasing function of H. Therefore, oot must be the 


... M at which a 
ot 


reaches its minimum. We have thus proved the fact 
that the maximum overlapping point is indeed the 
optimal partitioning point. 


(3.14) 
show that in 


one and only value in the range 1 


Since equaion (3.10) is cubic in H, and that there is no 
linear term of H present, it can be shown [8] that there 
always exists a unique solution in the range 1... M 
for A opt which satisfies equation (3.10). For large M, 
the solution can be approximated by 


qT 
Ft (2x M?x—)!/8 


- 2 (3.16) 


4. Simulation. In this section, we describe a 
simulation study which was performed to validate the 
assumptions made in the preceeding analysis. A sample 
of the simulation results are shown in the following 
plot. The solid line indicates the results predicted using 
the analysis presented in the previous sections, while 
the simulation results are represented by point markers. 
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In Figure 3, the difference of execution times for 
different cache line sizes are significant for small values 
of H, ( H<8 ). For larger values of H, i.e. H > 8, the 
difference is much less substantial. As was shown in 
Section 4, the execution time becomes computation 
bounded for large values of H, i.e. H> FA oot’ 
Therefore, the change of communication cost due to 
different cache line size has little impact on the 
execution time. On the other hand, for H less than 
Ant the execution time is communication bounded, 
therefore, any increase in the communication cost will 
directly contribute to execution time. 


From Figure 3, it can be seen that the line size can 
have a significant effect on execution time ( a factor of 
3) in terms of the optimum attainable performance. 
The smaller the line size (from the data shown) the 
better would seem to be the case. However, for very 
small line sizes, where small depends on the packet 
overhead, the additional overhead causes performance 
degradation. Thus there is a window of appropriate 
cache line sizes. In summary, the simulation results do 


- indeed validate the analysis. 


5. Optimal Speedup. The optimal decomposition 
can be used to derive a speedup function. Let us derive 
the optimum speedup, denoted by Spt? for our 
partitioning scheme. Here the speedup means the total 
uniprocessor time di divided by the multiprocessor 


processing time 7, ,, i.e. S=T,/T,, . We can derive 
the optimal speedup from either di SOR Dt: OF 


Cc 
example, substitute T,., in (5.1), we arrive at 
Cc 


if & 
U 
eee (5.1) 
Tit. 
Since l= nX Mx bop? Din =O XMXtO and 
i =(n—1)Xt,,, at optimal partition, we have 
t 
nx M cp 
ee 5.2 
“opt nx M+(n-1) t, 62) 
Assuming M>>1 and _ consider the optimal 
condition, we have 
1 Ts 
SW edreeei ee XP (5.3) 
From (5.3) and (3.16), we get 
1 21/3 o 2/3 
S. 1 5Xx2xM’) x (F) (5.4) 


Comparing (5.4) with the cotresponding result in [2], we 
can observe that, in both approaches, the optimal 


speedup is proportional to (TJ, / Ee 3. However, our 


result is different from the result in [2] by the factor 


((2M?)1/3) /9, This factor implies that the optimal 
speedup increases with the size of the matrix. 


6. Communication/Communication Tradeoffs. 
The study reported here is interesting because the task 
in question, matrix multiplication, is regular enough 
that it can perform well on very powerful processing 
elements, but has sufficient parallelism to be exploited 
on very large numbers of processing elements. What 
the results of this study indicate is that for a fixed 
communications speed, one can use less powerful 
processing elements to achieve comparable speedup, 
provided one has sufficiently many of them. 
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Figure 4: Effect of proc. speed on opt. performance 


Figure 4 shows the time required for a large matrix 
multiply at the optimal partition size as the processor 
speed is varied. The two identified points represent Ty, 


= 3 and T, = 6, i.e. a halving of the speed of each 


PE, while maintaining the communications costs fixed. 
The resulting optimal performance shows a decrease of 
about 26%, with an increase in the number of 
processing elements to achieve this performance of 
about 50%. Thus we pay a penalty for the loss of 
processor speed, but it is much less than a halving of 
performance. Hence in considering alternatives, one 
should keep in mind the non linear relationship between 
individual element speed, and system speed. 


7. Discussion. Some related results are reported in 
the recent work described by Vrsalovic et, al, and 
Cvetanovic [2, 7]. In contrast to the work reported in 
[2], however, we analyse the time for a single matrix 
multiplication, as opposed to the rate of execution of 
matrix multiplies (i.e. latency vs throughput). 


We are able to show analytically, for the ideal 
machine model, square partitioning is necessary for 
optimality and our analysis is applicable for both 
square and non-square matrices [8]. Jalby and Meier 
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[3] in the Cedar Project have reported a study on 
optimal partitioning of matrix multiplication for their 
machine. Under a different partitioning scheme, their 
results show that a rectangular submatrix may be the . 
best choice for a processor architecture with combined 
vector and parallel capabilities. However, their work 
does not provide a complete analysis of the combined 
computation and communication time. Instead, by 
stating that such combined minimization problem is 
difficult, they consider each as an independent problem 
by considering only the dominant terms. 


We have studied the _ critical problem _ of 
computation /communication tradeoffs for 
multiprocessor systems. We have demonstrated that 
there exists an optimal decomposition under which the 
matrix multiplication problem can achieve a maximum 
speedup on a shared memory multiprocessor with single 
bus interconnection. The size for the optimal 
partitioning is analytically characterized, and an 
allocation is formulated which can achieve optimal 
performance. Simulation results have confirmed our 
analytic frame work. 
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Abstract: In this paper, we present an analysis of a con- 
currency control algorithm for replicated database systems. 
We present a model of distributed database systems, which 
provides a framework to study the performance of different 
concurrency control algorithms and use the model in the 
analysis of a concurrency control algorithm. We show that 
even after making some assumptions, detailed performance 
model of a concurrency control algorithm is so complicated 
that it is impossible to find its closed-form solution. We 
circumvent this problem by making two assumptions: first, 
we assume that the state of a site is statistically indepen- 
dent of the state of other sites, which permits us to analyze 
a single site rather than the whole system. Second, we 
assume that an update sees the average state of the system 
and all the updates exhibit the average steady-state 
behavior, which permits us to work with averages rather 
than with probability distributions. Therefore, the technique 
used in the analysis is approximate and iterative. 


1. Introduction 

In this paper, we analyze the performance of an 
optimistic concurrency control algorithm for replicated 
database systems. In optimistic concurrency control algo- 
rithms, an update is executed concurrently with other 
updates without performing any synchronization. How- 
ever, before its computed values are written into the data- 
base, it enters a validation phase which determines if the 
update has conflicted with any concurrent update. In case 
of conflicts, lower priority update is aborted, else its com- 
puted values are written into the database. Depending upon 
the way intersite communication and validation phase are 
carried out, several optimistic algorithms have appeared in 
the literature. The optimistic algorithm which is analyzed 
in this paper functions in the following way: A site S; 
maintains three queues — SuspendQ, which contains local 
updates which can not be executed due to conflicts with 
local updates awaiting the results of their validation; 
LocalQ; which contains local updates that have been tenta- 
tively executed and are awaiting the results of validation; 
RemoteQ; which contains remote updates that are awaiting 
the results of their validation. 

When a site S; receives a user request to execute an 
update U, it performs the following sequence of actions: if 
there is no entry in LocalQ, that has w-r conflict with U, 
then it assigns U a timestamp TS(U), places an entry for U 
into LocalQ;, tentatively executes U, and sends out 
validation(write_set(U), TS(U)) messages to all other sites. 
When a site S;, i4j, receives this message, it places 
(write_set(U), TS(U)) into RemoteQ;, updates its clock, and 
returns a reply message to S; which contains the current 
timestamp of S;. After S; has received a message with 
timestamp larger than TS(U) from all other sites (called 
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condition R1), it checks if there is an entry in RemoteQ, 
which has w-r conflict with U and has smaller timestamp 
than TS(U). (Negation of this condition is denoted by R2.) 

If condition R2 passes, then U is committed: site S. 
writes the computed values in its database copy and sends 
these values in update-commit messages to all other sites. 
When a Site S; receives these values, it writes them into its 
database copy, discards the corresponding entry from 
RemoteQ;, and aborts all local updates from LocalQ; which 
have r-w conflict with U. If condition R2 fails, then U is 
aborted, its entry is removed from LocalQ., and update- 
abort messages are sent to all sites. On the receipt of this 
message, a site removes the entry for U from its RemoteQ. 
A Site executes write action of an update using Thomas- 
Write rule (TWR) [4] where write to a data object is 
ignored if it has already been written by an update of 
higher timestamp. 


2. Performance Model 


We model arrivals of updates by a Poisson process. 
This is justified because a database is usually shared by a 
large number of independent users. We assume that the 
size of the readset and the writeset of updates are identi- 
cally and independently distributed random variables with 
Geometric distribution. This assumption is justified because 
the updates that reference a small number of data objects 
occur more frequently than the updates that reference a 
large number of data objects. We assume that access to 
data objects is uniformly distributed across the entire data- 
base, ie., every data object is accessed with equal likeli- 
hood. 

A computer system, or site, provides the users with 
the facilities such as CPU for processing and main memory 
and secondary memory (usually disk) for storage capabil- 
ity. Execution of an update requires service at CPU for 
queue manipulation, update computations, message han- 
dling, etc. Secondary memory holds the data objects of a 
database, and in practice, it may consists of several disks. 
For simplicity, we assume only one disk in the model. An 
update makes a disk access for reading data objects or for 
writing computed values into the database. Disk service 
time depends upon seek-time, rotational delay, and data- 
transfer delay. We assume that the disk service time is 
exponentially distributed and the disk serves requests in 
FCFS order. Usually, concurrency control algorithms 
require data structures such as queues, lock tables, graphs, 
etc. We assume that the main memory at each site is large 
enough to hold these data structures. 


Communication medium affects the performar.ce by 
introducing finite delay in every intersite communication. 
This delay is modeled by an infinite server, Le., the com- 
munication medium serves all intersite communication in 
parallel. In practice, this usually holds in store and for- 
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Figure 1. Flow diagram of the life-cycle of an update 


ward networks. To simplify the model, we assume that the 
service time of the communication medium (i.e., time taken 
by a message to travel from one site to another site) is 
constant. 


Parameters of the Model 

(1) N: The number of sites in the system. 

(2) A: Update arrival rate which is the rate of update 

ve at a site. (Poisson distributed with parameter 
.) 

rs/ws: Update size which is the average number of 

data objects in the readset/writeset of updates. 

(Geometrically distributed with mean rs/ws.) 


M: Size of database which is the number of data 
objects in each copy of the database. 

T: Message propagation time which is the average 
time taken by a message to propagate through the 
communication medium. 

1/u: Disk access time which is the average time to 
read data objects in the readset or write data objects 
in the writeset of an update. (Exponentially distri- 
buted with parameter [L.) 

Time spent by CPU in comparing timestamps, mani- 
pulating queues, computing updates, and processing mes- 
sages is usually much smaller than disk access time and 
message propagation time. Therefore, we neglect CPU in 
the performance analysis. 

In the performance study, we will be interested in 
computing update response time which is the time interval 
between the instants when an update is submitted by a user 
to a site and when the update is completely executed at 
that site. 


(3) 


(4) 
(5) 


(6) 


3. Performance Analysis of the Algorithm 


The life-cycle of an update is depicted in figure |. 
When an update arrives, it is diverted to SuspendQ if its 
host site already has a conflicting entry in its LocalQ. Oth- 
erwise, the update proceeds with its tentative execution 
where it performs its read and compute operations and per- 
forms all its writes on a temporary storage. Then the site 
performs validation for the update by exchanging mes- 
sages. If the update fails the validation, it aborts and res- 
tarts, else it commits its writes to the database at all the 
sites using TWR. 


3.1. Difficulties in Analyzing the System 
Computation of update response time requires compu- 
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tation of the probability of going to SuspendQ (denoted by 
Piis)) average wait time in SuspendQ, average wait in 
LocalQ, the probability of passing validation (denoted by 
P.ass)) and time taken to commit the writes. Computation 
of P,,,, and P.,,s, for an update requires detailed knowledge 
of the system state when these decisions/checks are made. 
For example, to determine P,,, for an update, we should 
know the exact data requirements for each update in 
LocalQ; likewise, we should know the exact data require- 
ments of all global updates to determine P,,,,. We can 
model such a system by a Markov chain by making 
appropriate assumptions about the probability distribution 
of service rates. However, the Markov chain will have 
such a large number of states and such a complex structure 
that it will not be feasible to analyze it even for a database 
system of small size. 


3.2. Approach Taken 

We handle the state space explosion by making two 
approximations. First, we assume that the state of a site is 
statistically independent of the states of other sites. As a 
result, we can analyze each site in isolation and the effect. 
of other sites on a site can be reflected by write/update 
activity due to other sites. Since we assume that the system 
is homogeneous, on the average all sites will perform 
identically and performance measures can be obtained by 
analyzing only one site. (Note that it is easier to analyze 
one site rather than the entire system.) Such decomposition 
technique has already been applied in the analysis of load 
balancing algorithms [1], concurrency control algorithms 
[3], multiple access protocols and store-and-forward packet 
switching networks [2]. 

After making appropriate assumptions about the pro- 
bability distribution of service rates, we can model a single 
site by a Markov chain whose states are detailed enough to 
capture concurrency control activities at that site, e.g., 
number of running/blocked transactions, data objects 
held/acquired by them, etc. It is not difficult to see that 
even for a database system of small size, Markov chain for 
a site will have such a huge state space with very complex 
structrure that obtaining its closed-form solution will be 
practically impossible. Here we make the second approxi- 
mation: rather than working with the probability distribu- 
tions, we work with the averages —- we assume that an 
update sees the average state of the system and all updates 
exhibit average behavior. 

We exploit the interdependence among the variables 
to derive a set of equations and solve it using an iterative 
technique. Initially, we assume that the probability of an 
update restart is zero, or very small, and compute waiting 


time in different stages of an update. From these waiting 
times, we estimate new probability of an update restart 
which in turn gives new waiting times. This process is 
repeated until the difference in the value of a variable of 
interest (e.g., the probability of restart or wait time) 
between two successive iterations is less than a desired 
value. 

The probability that two independently selected 
groups of data objects of size a and b out of the M data 
objects have a conflict is (denoted by ®(a, b)): 

@M(a, b) = 1 - Prob(there is no data object common 
between the groups) 


=1- [Me] [it] = atom 
This result is used heavily in the analysis. 


3.3. Performance Analysis 

An update may undergo execute-restart cycle several 
times before it actually commits. If the average waits in 
SuspendQ and LocalQ are W,,, and Wi,,, respectively, 
then the average duration of an execute-restart cycle, R, is: 


R=P 5% WeustW loc eld) 


We assume that the probability of an update restart is 
independent of the probabilities of its earlier restarts. 
Therefore, the number of restarts an update undergoes 
before it commits is Geometrically distributed. If the pro- 
bability of an update restart is P,,, (= 1 - Ppas.) and the 
average time to commit writes of an update is W,,;,., then 
the update response time is, 


Resp= ( LF ies) *R+P,.6(1—Pres) *2R+.... |+W ie 


Resp= 2) 


R 

(1—P,.s) +W write 

If on the average there are Nj, entries in LocalQ, 
then an arriving update will go to SuspendQ if it intends to 
read any data objects which belongs to the writeset of any 
of the N,,, updates; that is, there is a data object common 
between rs and N,,.*ws data objects. The probability of 
this happening is, 
Noo WS *rs 


M 


The effective rate of update arrival at a site is A/(1- 
P,.;) because aborted updates are retried. Therefore, the 
average number of entries in LocalQ is (Little’s law ), 


Nioc=V/ ( l= P ee) WW 186 


Pos = Pts, Nigce*WS) = ...(3) 


AA) 


S, sends validation 
message to S 
j 


S. places the entry 
into RemoteQ 
J 


Ss 
from RemoteQ; 


nn 


Computation of W,,, 


An entry stays into LocalQ until it is aborted by a 
committed entry in RemoteQ (the committing entry has w-r 
conflict with the aborted update and has priority over it) or 
until condition R1 holds for it (at that instant, it may com- 
mit or abort depending upon whether it passes condition 
R2 or not). If there are N,.,, entries in RemoteQ on the 
average, then N,,.,,*(1-P,.,) of them will commit on the 
average and can potentially abort entries in LocalQ. Since 
a committing entry in RemoteQ aborts all the entries in 
LocalQ which have r-w conflict with it, the probability that 
an entry in LocalQ is aborted before condition R1 holds 
for it is, 


Nga ee 
M 


For an entry in LocalQ, condition R1 holds for it after 
a delay of 2T after it has been placed in LocalQ. An entry 
can get aborted at any instant with equal probability before 
R1 holds for it (an assumption). Therefore, on the average 
an entry stays in LocalQ for duration T given that it gets 
aborted before R1 holds for it. If an entry is not aborted 
before R1 holds for it (this happens with the probability 1- 
P,), it may get aborted after R1 holds for it provided there 
is a smaller timestamp entry in RemoteQ which has w-r 
conflict with it (1.e., R2 does not hold). Since entries arrive 
in RemoteQ at a rate (N-1)A/(1-P,,.,), on the average 
2T*(N-1)A/(1-P,.,) entries in RemoteQ can cause an update 
to abort due to R2 (only if they commit and have w-r 
conflict with this update). The probability of this happen- 
ing is: 
P, = Prob(R2 does not hold for UI U is not aborted before 
R1 holds for it) 

ous 2T(N-1)A*ws = 2T(N-1)A*ws*rs 
(ISP ee) (1—P,..)M 
The update is committed if condition R2 also holds 


for it when condition R1 holds for it. The probability of 
this happening is: 


Poass=l — Pes 


pass 


= Prob(R2 holds for Ul U is not aborted before R1 
holds for it)* Prob(U is not aborted before R1 holds for it) 
=(1-P,)*(1-P,) : 


: | 


S, sends commit or 
abort message to S 


)*ws*rs 


Py = (rs, Neem “(1—Pres)* ws) = (5) 


_2T(N-L)A*ws*rs 
(1-P,,,)M 


res 


fro-ry ...(6) 


time ———_—_»> 


removes the entry 


Figure 2. 
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Irrespective of the outcome of R2 check, the update 
on the average waits for 2T time units. Also, an entry 
waits in LocalQ until it is aborted before R1 holds for it or 
after R1 holds for it. Therefore, 

Wioc = E{wait in LocalQ | update is aborted before R1 
holds]* Prob(update is aborted before R1 holds) + 
E[wait in LocalQlupdate is not aborted before R1 holds]* 
Prob(update is not aborted before R1 holds) 


Wioc=l*P | 42T*(1-P;) AT) 


Computation of W,.., 


If on the average an entry waits in RemoteQ for W,.,, 
time units, then from Little’s law (note that entries arrive 
in RemoteQ at a rate (M-1)A/ (1-P,,,)): 


Nyem=W rem *(N-1)A/(1-P es) ...(8) 

Next, we compute the average wait in RemoteQ, 
Wem: 10 determine W,.,,, we first establish a relationship 
between the time interval entries for an update stay in 
LocalQ of its host site and RemoteQ of a remote site. Con- 
sider the scenario shown in figure 2. At instant tl site S; 
sends out a validation message for an update to site S; 
which reaches Sj; at instant tl+T and site S; then places an 
entry for the update in RemoteQ,. 


At instant t2, S; commits or aborts the update and 
sends the corresponding message to S;. S; receives the mes- 
Sage at instant t2+T and removes the corresponding entry 
from RemoteQ.. (Interval x, for which the entry stays in 
LocalQ;, depends upon whether the update was aborted 
before condition R1 held for it or not.) Note that the dura- 
tion for which the corresponding entry stays in RemoteQ,, 
(t2+T) - (tl+T) = x, is the same as the duration for which 
the corresponding entry stays in LocalQ,. Therefore, for an 
update, its entries stay in LocalQ and RemoteQs for the 
Same amount of time and 


Werem=W joc=P1T+2T(U-P ) (9) 


Computation of W,,, 
An entry in SuspendQ is checked for the possibility of 
getting unblocked whenever an_ entry departs 
(commits/aborts) from LocalQ. The average interdeparture 
time of updates at a site, IT, is (1-P,.,)/A because update 
departure rate is the same as the effective arrival rate, 
A/(1-P,.,), in the equilibrium. Since a departing update 
releases ws data object on the average, the probability that 
a data object previously unavailable to an update in 
SuspendQ, now becomes available is ws/(N,,.*ws) = 
1/N),,. If the average conflict size * is cs, then the proba- 
bility that an update in SuspendQ gets unblocked when an 
update departs from LocalQ is p= (1/N,,.)°* (which is the 
probability that all the conflict-causing data objects are 
freed by the departing update). If we assume that the pro- 
bability of an update in SuspendQ getting unblocked is 
independent for different departures of entries in LocalQ, 
then the number of departures before an update in 
SuspendQ gets unblocked follows a Geometric distribution 
with parameter p. As a result, the average wait in 
SuspendQ is, 
r (Nice) 1 —P, es) 


2a = re ...(10) 


WwW 


* Conflict size is the number of conflict creating data objects; 
ie.. the data objects which are common in two conflicting updates. 
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Computation of W,,... 


Data objects at a site are stored on a secondary 
storage device, say a disk. Therefore, an access to the disk 
is made when read and write actions of updates are exe- 
cuted. Note that the read and the write actions arrive the 
disk at rates A/(1-P,.,) and NA, respectively. If we assume 
that the read and the write requests arrive at the disk 
according to Poisson distribution with parameter A = A/(1- 
Pies) + NA and the disk service time is exponentially distri- 
buted with parameter pp for all these requests, then the 
average response time at the disk is (M/M/I server), 


1 


Taise= 9 =W write (11) 


We obtain the performance of the algorithm by solv- 
ing the system of equations in the following manner: we 
start with small values for waiting time in LocalQ and 
RemoteQ and compute the probabilities of update restarts - 
P.., and P, - using equations (3), (4), (5), (6), (7), and (8). 
We use these values in equations (7) and (9) to compute 
new values for waiting times - W,,, and W,,,,. This pro- 
cess is repeated until difference of waiting times between 
two successive iterations is less than small quantity, say 5 
percent. Finally, the average update response time is com- 
puted using equations (1), (2), (3), (10), and (11). 


4. Concluding Remarks 

We have presented a performance model of distri- 
buted database systems and used it to analyze the perfor- 
mance of an optimistic concurrency control algorithm in 
replicated database systems. We have shown that even after 
making some simplifying assumptions, detailed perfor- 
mance model of a concurrency control algorithm is so 
complicated that it is impossible to find its closed-form 
solution. We have solved this problem by assuming (i) that 
the state of a site is statistically independent of the state of 
other sites, which permits us to analyze a single site rather 
than analyzing the whole system and (ii) that an update 
sees the average state of the system and all the updates 
exhibit the average steady-state behavior, which permits us 
to work with averages rather than with probability distribu- 
tions. The performance analysis not only provides a sim- 
ple and quick method to compute the performance meas- 
ures of an optimistic concurrency control algorithm but 
also it gives an approximate method to analyze other dis- 
tributed concurrency control algorithms. 
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Abstract 


We give parallel algorithms for solving some string comparison prob- 
lems on the hypercube. For strings x and y with length(z) = m, 
length(y) = n, and assuming n > m, we show the following: the 
string edit problem, the longest common subsequence problem and the 
minimum-length time-warping problem can be solved in O(log” n) time 
using O(1) space per processing element (PE) on an SIMD hypercube 
of O(n?/log? n) PE’s. We also show that the longest common sub- 
string problem can be solved in O(logn) time using O(1) space per 
PE on an SIMD hypercube of O(n?) PE’s. Finally we show that the 
substring problem (where typically m is much smaller than n) can be 
solved in O(m-+ log n) time using O(m) space per PE on an MIMD hy- 
percube of O(n/m) PE’s. We note that this algorithm has an optimal 
processor-time product if m = Q(logn). The results of implementing 
this algorithm on the NCUBE/7 hypercube machine are also presented. 


1 Introduction 


String comparison techniques are important in many diverse fields such 
as text processing, image and signal processing, pattern recognition 
and artificial intelligence [20]. The string edit problem is to compute 
the minimum cost of transforming one string into another string using 
the edit operations insert, delete and replace [22]. Each edit operation 
has an associated cost which is a function of the alphabet from which 
the strings were composed. The cumulative cost is a measure of the 
dissimilarity or distance between two strings and is called the edit dis- 
tance or the weighted Levenshtein distance [18]. The longest common 
subsequence problem (LCS) is to determine the length of the longest 
subsequence (not necessarily a contiguous substring) common to both 
strings [22]. The minimum-length time-warping problem has particular 
importance in the domain of speech processing and is concerned with 
matching strings based on compression and expansion [20]. Many other 
sequence comparison problems may be found in [20]. 

The three problems just mentioned may be broadly characterized 
as inexact string matching problems. A slightly less difficult class of 
problems concerns finding only exact matches between strings or sub- 
strings. The substring problem asks whether a shorter string (called the 
pattern) is contained in a longer string (called the tezt) [16]. The longest 
common substring problem (LCG) is to find the longest substring that 
is common to both strings [20]. 


2 The String Edit Problem 


The string edit problem is formally defined as follows [22]: let « = 
£1LQ...£L2m and y= yi yo...Yn be strings over a fixed finite alphabet. 
We are given the edit operations insert, delete and replace in order to 
transform z into y. Associated with these operations are costs I,, Dp, 
and Rg», respectively, for all symbols a and 6 in the alphabet. The 
minimum cumulative cost (called the edit distance) can be found using 
dynamic programming: let M be a (m+ 1) x (n +1) table such that 
M (i,j) is the minimum cost of transforming 2; ...2; intoy, ...y;. The 
table entries can be computed using the recurrence: 


M(0,0) = 0; 


M(i,0) = 0Dz,; 
r=l 
j 
M(0,j) = Oh, 
ral 


“This research was supported in part by NSF Grants DCR-8420935, 
DCR-8604603 and ECS-8505662. 


190 


M(i—1,j-—1)+ Re, y;; 
M(i,j) = mind M(i—1,j)+ Dz,, 
M(i,j —1) + Iy;. 


The longest common subsequence problem is a special case of the string 
edit problem using the following cost function [22]: 


f= DS 1s 


Rap = { 0 ifa=b, 


2 otherwise 


for all symbols a and 6 in the alphabet. The minimum-length time- 
warping problem is solved by restricting the form of the recurrence as 
well as the cost function (also, the table is m x n) [20]: 


M(0,0) = 0; 
M(i—1,j-1)+7Re,,y;; 
M(i,j —1) +1/27Re,,y,- 


M (i,j) 


Now suppose that n > m. It has been observed that virtually all 
sequence comparison problems are variants of the string edit problem 
and hence can be solved by the same dynamic programming technique 
[18]. This approach takes O(mn) time sequentially [22]. An asymp- 
totically faster algorithm using a divide-and-conquer approach taking 
O(mn/ min{m, log n}) time was given in [19] (All logarithms mentioned 
in this paper are base two.) In [18] it was shown that the edit distance 
can be computed in parallel on a two-dimensional systolic array in O(n) 
time using mn processing elements (PE’s). An algorithm yielding the 
actual edit sequence in O(n) time using a one-way two-dimensional it- 
erative array of mn PE’s was given in [5]. The number of PE’s can be 
reduced to n using a one-way one-dimensional systolic array with the 
same time bound of O(n) [6, 12]. A similar algorithm with these same 
bounds but using an SIMD parallel machine that can simulate a linear 
array was given in [8]. On a bus automaton the LCS can be computed 
in constant time [3]. 

Our algorithm is essentially a parallelization of the dynamic pro- 
gramming technique, as are the parallel algorithms just noted. We will 
first indicate how to reformulate the recurrence relation as a weighted 
directed graph. Let us define a family {G,}, n > 1, of weighted directed 
graphs as follows: G, = (V, £) where 


V= {GN [0545 <¢n-1}, 
B= {(i,j) — (k, 0) |k=%4+1 or l= j+4+1 or both, 
for0 <i,j,k,l<n—-1}. 


Associated with each edge is a weight cj j)~(%,7). The graph G4 is 
illustrated in Figure 1. 


Figure 1: The graph G4. 


Now focus on the string edit problem since the other two problems 
are subproblems and hence solvable using this same technique. Edge 
weights are assigned to the graph G+, as follows: 


ly, fort=k,l=j +1, 
CG) KD = Dz, fork = t+ lj= f, 
Rg, y, fork=7+1l=j+1 


for 0 < i,k < m, 0 < j,l <n. If m $n then the remaining edges 
are weighted with +oo. It is not difficult to see that the edit distance 
between z and y is equal to the length of the shortest path from vertex 
(0,0) to vertex (m,n). Furthermore, the sequence of edges comprising 
this shortest path correspond to the optimal sequence of edit operations 
used. 

The algorithm we give uses divide-and-conquer to decompose the 
problem. Even though only one path is ultimately desired, it will be 
necessary to compute many paths in parallel at early stages of the 
algorithm. The perimeter-pairs shortest path problem for this family 
of graphs is to determine the lengths of the shortest (directed) paths 
between all pairs of vertices lying on the perimeter of G,. 


2.1 Overall Scheme 


Initially the n? vertices of a given graph G, are partitioned into n? 
subgraphs, called blocks, one vertex per block. These blocks are then 
merged pairwise to form a collection of n?/2 blocks, each containing two 
vertices and the edge that connects them. We then solve the perimeter- 
pairs problem for each of these blocks taken individually, which is trivial 
for these blocks. The resulting blocks are then merged pairwise again 
and the perimeter-paths problem is solved for these enlarged blocks. 
This procedure is repeated until one block containing all of the graph 
remains. At this point, we have solved the problem for G,. Notice that 
at a given level of the merging, the perimeter-pairs problem is being 
solved for disjoint subgraphs. Hence it can be performed on every block 
in parallel. Also, it is not difficult to solve the perimeter-pairs problem 
for a newly merged block if we previously have solved the problem for 
the two component blocks. 

The initial partitioning and subsequent merging must preserve the 
topology of the graph. In other words, we must merge subgraphs that 
are adjacent in the graph. For this family of graphs there is a natu- 
ral and symmetric merging procedure that obeys this constraint. For 
k = 2',0 <i < logn—1, we alternately merge k x k blocks pair- 
wise horizontally and then merge the resulting k x 2k blocks pairwise 
vertically. This is illustrated in Figure 2 for the graph G4. 


Figure 2: (a) First, (b) second, (c) third, and (d) fourth stages of 
merging for the graph G4. 


2.2 A Graph-theoretical Discussion 


Consider the graph G,, where n is a power of 2. Now focus on a given 
stage where we are merging two k x k blocks horizontally. Denote 
the left-hand block as the A-block and the right-hand block as the 
B-block. Also denote the resulting adjoined block as the AB-block. 
The perimeter pairs problem has previously been solved for these k x k 
blocks at earlier stages of the procedure. Let P,4 be the set of vertices 
lying on the perimeter of the A-block and let Pg be similarly defined. 
Pap is the set of vertices that he on the perimeter of the adjoined 
AB-block. Denote by Isp(u, v) the length of the shortest path between 
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vertex u and vertex v, or +00 if there is no path from u to v. We 
can define the lengths of the perimeter-pairs paths for the A-block as 
a set of triples A = {(u, v,lsp(u, v)) |u,v € Pa}. B (AB) is similarly 
defined for the B-block (AB-block). Let D be the edges going directly 
from the A-block to the B-block. 

In order to compute AB we need only A, B and D. Initially, we will 
indicate how to compute AB = {(u,v,lsp(u,v)) |u,v € (Pa U Pp)} 
from which AB is easily obtained. AB can be written as the union 
of four disjoint sets: AB=AUBUA~+BUB~A where 


A~ B= {(u,v,lsp(u,v)) |u€ Pa,v € Pa}, 


Bwr A= {(u, vu, lsp(u,v)) |u € Pg,v € Pa}. 


We can immediately simplify B-~ A = {(u, v, +00) |u € Pg,v € Pa}. 

Consider splitting the edges of D and inserting a unique “imagi- 
nary” vertex within each edge as follows: for each (r — s) € D define 
two new weighted edges (r — w7) and (wt, — s) where w” is a new ver- 
tex that now lies directly between r and s. Let J = {w? |(r —- s) € D} 
be this set of imaginary vertices. Furthermore, define the weights of 
these new edges as Cr+wr = Cr+ and Cyr+s = 0. This is illustrated in 
Figure 3. 


:) 


oy B-block 


Figure 3: The set of imaginary vertices lies between the A-block and 
the B-block. 


The motivation for this construction 1s to write an expression for 
A ~ B in terms of the shortest paths to and from the imaginary 
vertices: 


Anu B= { (use, min {ise(u,w) + Isp(w, vt) | u E Pa,ve Pa} ; 
Ww 


A~ B can be thought of as follows: the shortest path from u € P, to 
v € Pg must include exactly one edge of D. Due to the construction of 
the imaginary vertices and new edges given above, we can equivalently 
state that the shortest path from u € P,4 to v € Pg must pass through 
exactly one vertex w € I. Also, because the path from u to v is the 
shortest path, the paths from u to w and from w to v must also be 
shortest paths. Therefore, having computed these component paths 
A-~+ I and I ~ B, they can be merged to yield the desired A ~ B. 
The case where two k x 2k blocks are merged vertically is handled in a 
similar manner. 


2.3 Using the Hypercube 


Dekel, Nassimi and Sahni [7] have given a matrix multiplication algo- 
rithm for the hypercube. It was shown that two n x n matrices can 
be multiplied in O(n/p + logp) time when n?p, 1 < p < n PE’s are 
available. Conceptually, they treat the hypercube as an p x n X n ar- 
ray. We refer to this virtual configuration as the cube configuration, as 
distinguished from the actual physical configuration of the hypercube. 

The sets mentioned above are implemented as matrices which are 
then manipulated via matrix operations with suitable transform ma- 
trices. It is difficult to describe hypercube parallel operations in terms 
of atomic PE operations. Our algorithm is written as a sequence of 
matrix operations that are successively applied to the input (the edge 
costs). This algorithmic representation is a straightforward and sim- 
ple way to express a parallel algorithm. The matrix of path costs for 
a merged block of vertices is formed by computing the “product” of 
the constituent path cost matrices, where the “product” operation is 
obtained by replacing the multiplication and addition operations of 
ordinary matrix multiplication with the addition and minimum opera- 
tions, respectively. The hypercube is partitioned into sub-cubes and the 
merging of the path cost matrices is done in parallel for the particular 
block of vertices assigned to each sub-cube. The sub-cubes themselves 
are then coalesced to form larger sub-cubes for the next level of merg- 
ing. For the sake of brevity we omit further details of the algorithm, 


which can be found in [13]. It can be shown that the overall time 
needed is O(log” n) for a hypercube of O(n3/log?n) PE’s using O(1) 
space per PE (a detailed analysis is given in [13]). Furthermore, the 
actual sequence of edges (corresponding to the edit sequence) can be 
found in the same time and space by a parallel divide-and-conquer 
search through the intermediate path cost matrices. 


3 The Longest Common Substring Problem 


A related but distinct measure of string distance is the length of the 
longest common substring (LCG) [20]. Let x and y be strings over a 
fixed finite alphabet, |x| = m, |y| =n and n > m. (Henceforth we ab- 
breviate length(z) = |r|.) There is a straightforward sequential method 
that is obtained by considering all possible alignments between the two 
strings which takes O(mn) time [20]. Subsequently this time bound 
was improved to O((m +n) log(m + n)) [15] and finally to the optimal 
bound O(m+n) [23] (see also [1]). We give a parallel adaptation of the 
straightforward method where the actual substring satisfying the LCG 
constraint as well as its length are obtained. On an SIMD hypercube of 
size O(n”), we show that the LCG can be found in O(log n) time using 
O(1) space per processing element. Consider the following definitions: 


Definition 1 Given strings x and rz, rz is said to be a rotation of x 
if there exist strings u and v such that rz = uv and x = vu. 


Definition 2 Given strings x = 21%2...2n and y = yiyo..-Yn, the 
aligned longest common substring (ALCG) of x and y is the LCG 
u of x and y with the additional constraint that u = xj4%1441...2, = 
YiYie1---yx forsomel<i<ck<n. 


Now suppose that we wish to find the LCG of x and y, where 
n = |y| > |z|. Let “#” and “$” be two distinct symbols not occurring 
in the alphabet of x and y. Further suppose that we already have a 
function ALCG(u, v) that returns the ALCG of u and v for strings such 
that |u| = |v]. The following is a reformulation of the straightforward 
algorithm using the terms just defined. 


input(z, y); 
u:=vu:=e; //e is the empty string// 
yl = YH; 


gis= zglv'|-lel. 
for all rotations ry, of y’ do begin 
u:= ALCG(z’, ry:); 
if |u| > |v| then 
Vi=U 
end; 
output(v); 


Definition 3 Given strings x = 2122...2n, Y = YiY2.-.Yn andc= 
C1C2...Cp, c 1s said to be the characteristic matching string of x and y 
af 

one { 1 if 2; = y;, 
*™ | 0 otherwise. 


Now note that the ALCG(z, y) function returns the substring of « and 
y corresponding to the longest substring of 1’s in the characteristic 
matching string of x and y. 

Again suppose that we are given strings z and y, n = |y#| > |z]. 
Also suppose that n is a power of 2. If n is not a power of 2 then y can 
be padded on the right end with additional “#”s to form the string 
yet "I-n_ From now on we will denote the padded string as y. We 
pad the right end of z to yield the string eg2tesnl—lal Again, from now 
on we will denote the padded string as x. We will need a hypercube of 
n? PE’s. The hypercube can be conceptualized as a two-dimensional 
array where each row and column of the array is itself connected as a hy- 
percube (e.g., see [4]). In this configuration we can alternately describe 
a PE by either its row-major array coordinates (i,j), 1 < i,j < n, or its 
binary address in the overall hypercube. The mapping from coordinate 
form to binary index form is: PE (i,j) has index (i—1)-n+(j —1). 
Initially suppose that x and y are stored in the hypercube such that 
PE (1,7) contains both x; and y;, 1 < j <n. The algorithm can be 
divided into four phases: 
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Phase 1 We wish to store the symbols x; and y; in PE (i,j), 1 < 
1,7 <n. It is well known that a datum can be broadcast to all nodes 
of a hypercube of size n in exactly logn time [7]. Each column of the 
conceptual array is a hypercube of size n and 2; and y; are initially 
stored in PE (1,7). Hence, the broadcasting is done in parallel on the 
n columns with the result that PE (7,7) holds x; and y; after O(log n) 
time. 

Phase 2 String y is now circularly shifted (i — 1) places to the right 
along row 7, 1 < i <n. The result is that y; is now stored in PE 
(i,(7 +7 — 1modn) +1) in row i. This shifting is done on every 
row in parallel. Each row of the conceptual array is a hypercube of 
size n. It has been shown that circular shifting of k positions for any 
0 < k < n—1 can be performed in O(logn) time on a hypercube of 
size n [17, 7]. Hence Phase 2 takes O(log n) time overall. 

Phase 3 In Phase 2, all n possible rotations of y have been generated, 
one unique rotation per row. The ALCG will now be computed in 


parallel for each row 7 using z and rf), where rf) is the rotation of y 
on row 7. The output of this phase will be the initial index (with respect 
to x) and length of the ALCG solution for each row. These values will 
be stored in PE’s (7,1), 1 < i < n, corresponding to the rows. Let us 
assume that each PE has stored its column index. We will focus on the 
ALCG computation for a particular row 7 but the following procedure 
will be performed in parallel on every row 1 < i < n. Initially the 
characteristic matching string c is formed for x and rf), one symbol 
per PE. This is easily done in parallel in O(1) time by comparing 2; 
and ri) at PE (i, j) and setting some register to 1 if x; = rf otherwise 
0. Therefore, the problem is reduced to finding the initial index and 
length of the longest substring of 1’s in the string c, which is distributed 
over the n PE’s of row 7 one symbol per PE. Since this subproblem is 
of independent interest we will state it as a lemma. 


Lemma 1 Given a string z composed of 0’s and 1’s, |x| = n, the length 
of the longest substring of 1’s in x can be found in O(logn) time using 
O(1) space per PE on an SIMD hypercube of size O(n). Furthermore, 
the initial indez of the satisfying substring also can be found in the same 
time, space and number of PE’s if each PE has stored its hypercube 
index. (A proof can be found in [13].) 


Therefore, after computing the characteristic matching string, we 

can apply Lemma 1 with the result that the initial index and length of 
the ALCG of x and ri is stored in PE (i,1), 1 <i <n after O(logn) 
time. 
Phase 4 The overall solution to the LCG problem now is found by 
taking the maximum of the length values stored in PE (i,1), 1 <i< 
n. The indices corresponding to the lengths are carried along in the 
computation. Since this column of PE’s is a hypercube, the maximum 
can be computed in a straightforward way in O(log n) time [7]. 


4 The Substring Problem 


Let x and y be strings over a fixed finite alphabet. The substring prob- 
lem asks whether z is a substring of y. Typically the string x (called 
the pattern) is much shorter than the string y (called the text) [16]. 
Sequential solutions to the substring problem have been extensively 
studied. The straightforward algorithm consists of aligning the pattern 
starting at the beginning of the text and comparing symbols pairwise in 
order from left to right. If the pattern matches the text for this align- 
ment then the occurrence is noted. Otherwise, the pattern is shifted 
one position to the right and the procedure repeated (see [1]). In the 
worst case this approach takes O(mn) time. An optimal algorithm tak- 
ing O(m+n) time can be obtained by using the results of the previous 
partial match [16]. Some related results and extensions were given in 
{2, 10] while a general discussion of this technique can be found in [1]. 

Parallel algorithms have also been given. On a concurrent-read 
concurrent-write parallel random access machine (CRCW PRAM) of 
O(n/logn) processing elements (PE’s), a time bound of O(logn) was 
obtained [9, 21]. We will show that on an MIMD hypercube of O(n/m) 
PE’s, the substring problem can be solved in O(m + logn) time using 
O(m) space per PE. This algorithm possesses an optimal processor- 
time product if m = Q(logn), i.e., m is asymptotically greater than 
logn. Our algorithm will use the sequential algorithm given in [16] as 
a subroutine. Therefore it is appropriate to consider the MIMD model 
since this subroutine can be performed asynchronously on each node in 
parallel. 


4.1 The Algorithm 


We need an MIMD hypercube of n/m PE’s. (Without loss of generality 
assume that m evenly divides n; if not, then the text y can be padded 
on the right with some symbol not occurring in the alphabet of the 
pattern z and the text y.) Let = x, 29...%m and y = yiy2.--Yn- 
It is well known that a hypercube can optimally simulate a two-way 
one-dimensional array of processors (e.g., see [4]). This technique uses 
a Gray code to map the PE’s 0,1,...,n/m—1 such that PE is directly 
connected to PE i—1 and7z+1. Initially suppose that the pattern z is 
stored in PE 0 and the text y is distributed over the PE’s such that PE 
2 contains Ymi+1Ymi+2 ---Ym(it1)- The algorithm is divided into four 
phases: 

Phase 1 We wish to broadcast the pattern x to all other PE’s. A 
technique given in [11] uses an embedding of spanning trees in order to 
obtain the optimal time bound of O(m + log(n/m)) for broadcasting 
m items to all PE’s in a hypercube of size n/m. Each PE stores the 
pattern as it is received. 

Phase 2 After Phase 1 PE 7 has stored the pattern x and a segment 
of the text Ymi+1Y¥mit2 ---Ym(i+1) Now it is possible that the pattern 
occurs in the text overlapping a boundary imposed by this segmen- 
tation. This is alleviated by reading a portion of the text stored in 
PE 7+ 1 and then searching for the pattern in this enlarged segment. 
Hence we want PE 7+1 to send Ym(i+1)+1Ym(i+1)+2 ++ -Ym(i+2)-1 to PE 
i, for all ?+1<n/m-—1 in parallel. Using the one-dimensional array 
connections this can be done in O(m) time in parallel. 

Phase 3 After Phase 2 PE 7, i < n/m-—1, holds an enlarged segment 
of the text Ymi4i1¥mi+2 ---Ym(i+2)-1 Which is 2m — 1 symbols long. PE 
n/m — 1 has stored yn—m4iYn—m42---Yn- In parallel each PE now 
searches for z in its respective segment. Using the technique given 
in [16] this takes O(m) time. The PE then notes whether or not the 
pattern was found. 

Phase 4 In O(log(n/m)) time the n/m PE’s can be polled in parallel 
to see if any match occurred [7]. 


4.2 Implementation on the NCUBE/7 


While the above algorithm has an optimal processor-time product for 
O(n/m) PE’s, in practice the size of a given hypercube machine is fixed. 
Hence, it is desirable to obtain an algorithm that yields an acceptable 
speed-up using a fixed number of PE’s for a pattern and text of arbi- 
trary size. With slight modifications, our algorithm was implemented 
on the NCUBE/7 hypercube machine using a maximum of 64 PE’s. 
The overall time for a given run can be decomposed into two parts: 
communication and computation. The communication time consists 
of the time needed to broadcast the pattern to all PE’s plus the time 
needed to send and receive the appropriate portions of text between 
neighboring PE’s in the conceptual linear array. The computation time 
is the time needed to search for the pattern within the enlarged seg- 
ment of text. The results are summarized in Figure 4 for a text of 
length 10,000 and a pattern of length 12. Additional experiments were 
performed to consider various ways of loading the text and pattern into 
the hypercube, as well as various searching techniques [14]. 


o = Overall 
x = Computation 
+ = Communication 


o3r-ry 


2 


4 16 
Number of PE’s 


Figure 4: Normalized timing values for the NCUBE/7 implementation 
of the substring algorithm. 


5 Summary 


We have given parallel algorithms for solving some string compari- 
son problems on the hypercube. The problems include both inexact 
and exact matching problems. Our algorithm for the inexact matching 
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problems, of which the string edit problem is the most general represen- 
tative, was presented in terms of a shortest path problem for a special 
family of graphs. The exact matching problems were solved in a more 
direct fashion. Our algorithms use many different conceptualizations 
of the hypercube architecture. For example, at certain instances it may 
be useful to treat the hypercube as a mesh or array. At other times 
a tree-like structure may be appropriate. The hypercube is very well 
suited for this due to its high connectivity. 
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Abstract 


We provide a useful technique called the Chain 
Theorem to derive good time lower bounds for sorting on 
the multi-dimensional mesh-connected model. For any 
d>2 we derive a lower bound which is significantly better 
than the distance bound of dn on the d-dimensional 
model. We also distinguish between indexing schemes by 
showing that there exists a poorer indexing scheme than 
the snake-like indexing scheme on the multi-dimensional 
model. All these results are obtained using the chain 
theorem. 


1. Introduction 


A mesh-connected processor array is widely accepted 
as a realistic model of a parallel computer. The problem 
of sorting on a mesh-connected processor array has been 
studied by many researchers[3-7, 9-14]. It is known that 
that (2d—1)n steps are optimal computing time within the 
leading term for sorting n* items into d-dimensional 
snake-like order on the d-dimensional mesh-connected 
model[3, 4, 13]. However, up to now we do not know 
whether the snake-like order is the best indexing scheme 
for sorting. A question whether the distance bound 2n—2 
is ultimately achievable on a two dimension nXn mesh- 
connected model by using some super indexing scheme has 
also been raised|6]. 


The authors of the present paper have shown that 
2.2247n steps are a time lower bound independent of 
indexing schemes for sorting n? items on the nXn mesh- 
connected model|{i]. Thus, the question posed by Ma et 
al.[6] has been answered. Time lower bounds for various 
indexing schemes on the.nXn mesh-connected model and 


the existence of a poor indexing scheme with 4n—2V2n —3 
time lower bound have also been shown[1]. These results 
have been obtained using a new technique called the chain 
argument(1]. 


In this paper we develop the chain argument in order 
that we can apply its extended version to derive nontrivial 
lower bounds for sorting. We show a theorem that gives a 
relation between computing time for sorting n? items and 
the number of processors in a certain region of the mesh- 
connected model. We can numerically calculate the best 
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lower bound obtainable from the theorem. For each d>2 
our result is significantly better than the distance bound 
of dn on the d-dimensional model. 


2. Preliminary 


We consider a general model of a synchronous d- 
dimensional mesh-connected processor array consisting of 
n® identical processors. It is denoted by M(1..n, ..., 1..n] 
or M{(1..n)*]. Each processor at location (44,..., iy), 
1<i,,..., tg<n, is denoted by M[#,,..., ¢,]. The distance 
between M[e,,...,2,] and M|jj,..., 7,] is defined to be 


d 
¥} |t,—J, | and denoted by dis((#,...,27), (Jp)-.5Jq))- Proces- 
k=1 


sor Ml[t,,...,%] is directly connected with processor 
M|j,,-.-, jg] if and only if dis((2,..., ig), Gy -- + dg))=I.- 
All n* processors work in parallel with a single clock, but 
they may run different programs. As for sorting computa- 
tion, the initial contents of M[(1..n)4] are assumed of n4 
linearly ordered items, where each processor has exactly 
one item. The final contents of M[(1..n)*] are the sorted 
sequence of the items in a specific order. In one step each 
processor can communicate with all of its directly- 
connected neighbor processors. The interchange of items in 
a pair of directly-connected processors or the replacement 
of the item in a processor with the item in one of its 
neighbor processors can be done in one step. The comput- 
ing time is defined as the number of parallel steps of the 
basic operations to reach the final configuration. 


An indexing on processor array M((1..n)?] is a one- 
to-one mapping from {1,..., n}? to {1,..., n%}. For an 
indexing J, the index of M[i,,..., 7,] is denoted by 
I(4,,..., tg). Some indexing schemes on the 2-dimensional. 
model are shown in Fig. 1. 


1 2 3 4 1 2 3 4 
5 6 7 8 8 7 6 5 
) 10 11 12 9 10 1] 12 
13 14 15 16 16 15 14 13 
(a) (b) 
Fig. 1. (a) Row-major indexing 


(b) Snake-like row-major indexing 


A subset of M[(1..n)4] is called a region. The distance 
between processor M|1,,..., 14] and region S' is defined by 
min{dis((i,,-.., %),; Gpesdg) IM jp--od¢] is in S} and 
denoted _ by dis((t,,..-, % 4), S). A sequence 
<(bipy tpg )yenes (teps--> Seq )—> is called a chain under index- 
ing J if and only if <J(tq,..., tg) L(tepes Meg) 18 a 
consecutive integer sequence in increasing order. For the 
above chain its length is c—1. Processor M[t,,..., tq] is 
called a corner if and only if for each y (I<j<d) 7; is 1 or 
n. For an integer 7 we denote integer n—i+l by 7. If 
M{t,,...,%4] is a corner and k is a positive integer, 
{M]j1,.-Jalldis((ty, --- 5 ta)(FpsrIg)<k} is called a 
corner region and denoted by CREG((1,,..., ¢,), k). If S is 
a region, the cardinality of S is defined as the number of 
processors in S and denoted by ||S||. An ordered pair of 
corner regions CREG((t),..., tg), &y) and 
CREG((t,,..., 4), ke) is called a sweep. The first corner 
region of the sweep is called the residing region and the 
second one is called the stretching region. The length of 
the sweep is defined to be d(n—1)+k,—ko. 


3. The Chain Theorem 


Sweeps play an important role in deriving time lower 
bounds for sorting on the mesh-connected model. Our first 
theorem gives a relation among the computing time, the 
length of a sweep and the length of a chain. 


Theorem 1 (Chain Theorem): For an indexing J on the 
d-dimensional mesh-connected model and a sweep of 
length TJ, if there is no chain in its residing region such 
that its length is equal to the cardinality of the stretching 
region, then there is no algorithm of time complexity less 
than J for sorting n* items on the model into the order 
specified by J. 


Proof: Let (CREG((4j,..., tg), ky), CREG((%,..., 7), ke)) be 
a sweep, where each 1,(1<j<d) is 1 or n. Then the length 
of the sweep is T=d(n—1)+k,—ky. Let S be the cardinal- 
ity of the stretching region. Suppose that an algorithm of 
time complexity T—1 is executed on the d-dimensional 
mesh-connected model. The effect of the initial contents of 
the stretching region to corner M[t),..., 77] does not 
appear before ((n—1)d—k,+1)-st step of the computation. 
Let a be the item in M[i,,..., ig] immediately after the 
((n—1)d—k,)-th step of the computation. The destination 
of item a depends on the initial contents of the stretching 
region. By assigning different initial values to the proces- 
sors in the stretching region, we can force item a into S+1 
different sorted positions. These different positions form a 
chain of length S, and should be within the residing region 
since the computing time is T—1. O 


Kunde|3] has shown a time lower bound for sorting in 
lexicographic order on the multi-dimensional mesh- 
connected model. We first show how to use the Chain 
Theorem by applying it to derive Kunde’s lower bound. 


Lemma 1: Let R={(7,,..., %,)| for each j (1<j<d) 2; 
dS (4;-1)<k}. Then 


| 
hk? /d!<||R ||X(k+d—1)4/d!, where Thal is the cardinality 


of R 


is a positive integer and 
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k— 2,2; 
k+d—1 k kx, 1 
Proof: ||R|| = as ih and fide, f dto"*: f dt, < 
0 Os 0 
k+d-1 &+d-1—-2, bed-1— Doe 
jr|| < f dx, f dx 5"** f dz,. Therefore, 
0 


0 0 
k4/d! < ||R|| < (k+d—1)*/d!. 0 
Corollary 1: Let M[i,,..., i] be a corner and V be the 
cardinality of CREG((#,,...,%),k). If  1l<k<n, 
k 4 /d!<V<(k+d—1)! /d!. 


Theorem 2: A time lower bound for sorting n? items on 
the d-dimensional mesh-connected model into _lexico- 
graphic order is (2d—1)n—[(d!n?—1)'/*]-2d42 steps. 
Proof: Consider the sweep 
(CREG((1, n,..., n), (d—1)n—d-+2), CREG((n, 1, ..., 1), 
[(d!n?-!)1/4))). The length of the sweep is 
(2d—1)n—[(d!n4—!)!/4]_2d+2. From Corollary 1 the car- 
dinality of the stretching region is greater than n?—! The 
length of the longest chain in the residing region is 
n?—!_1. Therefore, from the Chain Theorem any algo- 
rithm for sorting n? items on the model into lexicographic 
order takes at least (2d—1)n — [(d!n?-1)/¢] — 2d 4 2 
steps. 0 


The above lower bound is also a lower bound for the 
snake-like order. 


4. A Poor Indexing Scheme 


Kunde[4] has shown that within the leading term, 
(2d—1)n steps are the asymptotically optimal computing 
time for sorting n* items into snake-like order. We show 
the existence of a poorer indexing scheme than the snake- 
like indexing. 

Theorem 3: There exists an indexing scheme such that 
any algorithm for sorting n? items by the indexing scheme 
on the d-dimensional mesh-connected model takes at least 
2dn—2[(d!)'/4 n4/?7]|-2d4+1 steps. 

Proof: Let k=[(d!)'/4n'/?]. Consider the sweep 
(CREG((1,.., 1), d(n—1)—k+1), CREG((n,.... n), &)). The 
length of the sweep is 2dn—2[(d!)/4n'/?]-2d41. From 
Corollary 1 the cardinality of the stretching region is not 
smaller than [n?/?]. We define an indexing scheme as fol- 
lows: The first [n@/?] sorted positions are in the residing 
region, the ([n?/?]+1)-st sorted position is in the stretch- 
ing region, the next [n?/?] sorted positions are in the 
residing region, the (2[ n?/?]+2)-nd sorted position is in the 
stretching region, and so on. Then the length of the long- 
est chain in the residing region is [n@/?}-1. This length is 
smaller than the cardinality of the stretching region. 


Therefore, from the Chain ‘Theorem this theorem holds. 
CO 


5. Cardinality of Various Regions 


In this section we describe how cardinalities of corner 
regions, center regions and unions of corner regions are 
evaluated. The union of k-corner regions on the d- 
dimensional model is defined by 

UJ CREG((#4,.., tg), &); denoted _— by 
(iy HEC, n}4 
UCREG(k, d). The r-center region of the d-dimensional 
model is denoted by CENT(r, d) and defined as follows: 
CENT(0, d) is the empty set. If n is even, CENT(I, d) is 


and 


{M[i,,..., tg]| for each 7 (1<j<4) i; is n/2 or n/2+1}, 
and otherwise CENT(1, d) is {M([n/2].... ,[n/2]]}. For 
r>2 CENT(r,d)= {M[iy,..., tq] dis((i, . capris 
CENT(1, Fea 

Lemma 2: Let insks(i+t)n, ‘O0<t<d—1. Then for each 
(cy,..., Cg) in {1, n}4, the following inequalities hold: 

ka _(k —n)é 4 Lk —2n) _ 4(—-1)! (k—tn)? 

d! (d—1)! 2'd—2)! t'(d—t)! 
<||CREG((c1,-, ca), FI 


cfkt+d—1)* | (k—n+d— =a —2n+d—1)! _ 


rT (d—1)! 2(d—2) 
oY oe he in+d—1)* 


t\(d—t)! 


Proof: Let us first evaluate the volume of a region on the 


d-dimensional real space. Let R(k) be {(i,,..., %4)| for 
each j (l<j<d) 1; is a positive real number, and 
i <k}, and let V(k) be {(tj,...,27)] for each 


j= 
F(ISi<A) . is a positive real number not greater than 
n, and i, <k}. 


If S is a region on the d-dimensional 


j=l 
real space, the volume of S is denoted by [|S]. Then 


k~ 2, 
k kz, 1 
[|R(k M=f ae, J dry f dzy=k*/d!, and 
0 
IV(AIIKICREG((ey,.-. €4), KIIKIV(E+4—1)Ih Let 
{a,,..., ag} be the set of properties on elements in 


R(k), where a; is the property that the value of the z-th 
coordinate is greater than n. Let R(a;, k) be the subre- 
gion of R(k) having property a;, and let N(a;) be the 
volume of R(a;, k). We also define R(a;’, k) as the subre- 
gion of R(k) not having property a; and N(a,‘) as the 
volume of R(a,', k). Since an element in R(k) can have 
more than one property, we also use the following nota- 
tions: N(a; ,..., a;,) is the volume of the subregion of R(k) 
having properties a, ,..., a; and N(a;J,..., 
of the subregion of R(k) not having properties aq, , 
Then |[V(k)||=N(a,',.... a’). From the principle 


inclusion and exclusion(8}], ||V(k)|| = |[R(«)|| — V( 

YN(4; a;) = N(4;, ay a; ) air ~ — 
(—1)7N(a1, ao,..., az), where the sum of DIN(4G;1--+ 4%.) 
is taken over all combinations of ¢ properties. If k<tn, 
then all the terms after the ¢-th term in the right hand 
side of the above formula are 0. By a simple integration 


a; ') is the volume 
sey a; 
of 
a; ) + 


as the one in Lemma 1, we have N(q;,,,..., a;)= —{k=in if 

d aye d\ (p—9y)4 

k>tn. Hence |IV(E)I] = £2 — | |ecae (dean 
d! 1) d! 2 d! 

— op (aay [bate eA kant (etn 

ty dtd! (d=) 2d -2) 

— - + (-1) eae where k>tn. Therefore the 


lemma holds. 0 

We consider that each element in UCREG(k, d) 
belongs to its nearest corner region. That is, we define 
DCREG((c,,..., ¢q), &) to be the set 
{M[t), a ae ig|ldis((c, ak ee C4); (41, 8879 tq) )<k and 
for each j (1<j<d) le;—t; |< n /2]—-1}. If n is even, 
UCREG(k, d) = U DCREG((c,,.--, ¢a), &)- If 

(c,--., egefl, n}4 
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is odd, 
DOREG(((cy,..-; Ca), &)- 
(c,..., ege{i, n}4 
Lemma 3: If t|n/2|<k<(t+1)[n/2] and 0<t<d—1, then 
(2k)! (2k—n)" , (2k —2n)t \ (2k—in)4 
d! (d—1)! 2!(d—2)! t!(d—t)! 
<|[UCREG(k, d)]| 
< (olk-pd—1))" (Ak+d—1)—n)? , (2Xk+d—1)—2n)? 
+ 
a) 2!(d—2)! 
= here —1)—in)4 
ae t'"(d—t)! 


UCREG(k, 4) 


ify.e 


(1 


Lemma 4: If (d—t—1)|n/2]<k<(d—t)|n/2] and 
0<t<d—1, then 
nf _(dn—2hy! (Ud —1)n—2k)4 ((d —2)n— 2k)". 
d!| (d—1)! 2!(d—2)! 
d 
| t+1 ((d— t)n—2k) 
a a lau (ae) 
oy ae d)\| 
nt —fdn ee ((d—1)n—2(k-+d—1))4 
+ 
(d ne 
Spiele a = _4\d 


2!(d —2)! t!(d—t)! 

The proofs of Lemmas 3 and 4 are similar to that of 
Lemma 2/2]. The values of the formulae bounding 
|[UCREG(k, d)|| in Lemma 3 is exactly the same as the 
values of the formulae in Lemma 4. If k<dn/4 then the 
evaluation of ||UCREG(k, d)|| by Lemma 3 is easier than 
by Lemma 4, and otherwise the evaluation by Lemma 4 is 
easier than by Lemma 3. 


6. A Lower Bound for an Arbitrary Sorting Order 


In this section we derive a time lower bound indepen- 
dent of indexing schemes on the d-dimensional model. 


Theorem A: Let V=||UCREG(k,, d)l| and 
||CREG((1,..., 1), ba) |[>n*—-[V/2], where 
1<k,, ko<(n—1)d+1. Then a lower bound for sorting n4 
items by any indexing scheme on the d-dimensional mesh- 
connected model is 2d(n—1)—k,—k +1 steps. 


Proof: We consider an arbitrary indexing J on the d- 
dimensional mesh-connected model. There exists a position 
(7.e, a processor) in UCREG(k,,d) such that 
[V/2]<I(b)<n4—|V/2]41 or |V/2]<I(b)<n?—[ V/2]+1. 


Such a position 6 is in at least one corner region 


CREG((t4,..., tg), 1). Without loss of generality we may 
assume that b is in CREG((n,..., n), k,). Consider the 
sweep, (CREG((1,..., 1), (n—1)d—k,+1), 
CREG((n,..., n), ky)). Since 6 is outside of the residing 


region, the length of the longest chain in the residing 
region is at most n4—[ V/2]—1. Since the cardinality of 
the stretching region is not less than n?—[V/2], from the 
Chain Theorem a lower bound for sorting n* items by 
indexing J on the model is 2d(n—1)—k,—k.+1 steps. O 


Good time lower bounds for arbitrary sorting order 
can be obtained from Theorem 4 by minimizing the value 
of k,+k,. Although it is difficult to give a general formula 
of the maximized lower bound as a function of n and d, 
we can numerically derive it for an arbitrary d. 
Proposition 1: A time lower bound for sorting n? items 


into an arbitrary order on the d-dimensional mesh- 
connected model is |(d—0.5+(0.5)'/")nJ—d steps. 


Proof: Let k, be ee 
CREG((t,,..., %), ky) are 
Lemma 1 Ve ioceen (k,, 


Then 2% corner regions 
mutually disjoint. From 
ihe is not less than [n4/d!J. 
Let ky be d(n—1)+1—|(1/2)!/4n]. Then the cardinality of 
corner region oe ..., 14), ko) is not less than 
n?—[V/2]. Hence, from Theorem 4, a lower bound for 
sorting n* items by any indexing scheme on the d- 
dimensional mesh-connected model is 
| (d—0.5+(0.5)'/4)n|—d steps. 0 

Since we are mainly interested in asymptotic lower 
bounds, we hereafter omit minor terms, ceilings and floors 
in formulae of lower bounds. 


From ane 3 
V=||UCREG(n, d)||>((2n)4—dn*)/d!. If the cardinality 
of CREG((t1,..., tq), 7) is not less ‘ven n?—V/2, from 
Theorem 4 a lower bound for sorting n? items on the d- 
dimensional model is (2d—1)n—r steps. In this case there 
exists such r in the range between (d—1)n and (d—2)n. 
Let r=(d—1—t)n, where O<t<1. Then 
[|CREG((t,,..., 7), r)l>n4?-V/2 if and only _ if 
(1-+¢)?—dt?<(27—d)/2. Let t+i=((24—d)/2)/4. Then 
I|CREG(i,,..., i), r)\fen4-V/2. Therefore, 
(((22—d)/2)'/4+d—1)n is a time lower bound for sorting 


n* items on the model. Hence we have the next proposi- 


tion. 


Proposition 2: A time lower bound for sorting n? items 
into an arbitrary order on the d-dimensional mesh- 
connected model is (((24—d)/2)!/4+d—1)n steps. 


Time lower bounds given by Proposition 2 are listed 
in Table 1. These lower bounds are not the best ones 
obtainable from Theorem 4. | 


Ea time lower bound 


2.0000n 

3.3972 Table 1: Asymptotic 
4.5650n lower bounds obtained 
9.6829n by Proposition 2. 
6.7528n 

7.7969n 


8.8267n 
For d=2 we choose k,=+(1—-V6/6)n and 
ko=(2-V6/3)n. Then ||CREG(i,,..., %), k)|I] >= 


n?—||UCREG(k,, d)||/2. Therefore, from Theorem 4, 
4n—k,—ko*2.2247n is a time lower bound|I1]. 

Theorem 5: An asymptotic time lower bound for sorting 
n? items into any sorting order on the 2-dimensional 


mesh-connected model is (1+V6/2)n2.2247n steps. 


For d=3, let kj=0.87n and ky=1.7294n; for d=4, let 
k,=1.12n and ko=2.2667n; for d=5, let k,=1.395n and 
k =2.7893n. From Theorem 4 we have the following 
theorems. 

Theorem 6: An asymptotic time lower bound for sorting 
n® items into any sorting order on the 3-dimensional 
mesh-connected model is 3.4086n steps. 

Theorem 7: An asymptotic time lower bound for sorting 
n* items into any sorting order on the 4-dimensional 
mesh-connected model is 4.6133n steps. 

Theorem 8: An asymptotic time lower bound for sorting 
n°? items into any sorting order on the 5-dimensional 


mesh-connected model is 5.82077 steps. 
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Lower bounds listed in Theorems 6, 7, 8 are the best 
ones obtainable from Theorem 4. The proofs of these 
theorems are given in [2]. For each d we can derive a 


time lower bound better than the lower bound given in 
Table 1 


7. Concluding Remarks 


The question whether there exists a sorting algorithm 
with less than (2d—1)n time for some indexing scheme still 
remains open. The snake-like row-major order and its 
trivial variations are only the known indexing schemes for 


which optimal sorting algorithms within the leading term 
have been found. We are interested in finding optimal 
sorting algorithms for other indexing schemes on the 
multi-dimensional mesh-connected model. 
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Abstract 


Loosely coupled parallel systems are an appealing architecture for in- 
creased computing power, because of the comparative simplicity of co- 
ordinating individual networked processors to work on a single problem. 
However, the appeal of such systems for many search problems is off- 
set by their marked failure to achieve linear speedup as the degree of 
parallelism grows. 

Here, the overheads (losses) that result in sub-linear speedup for search 
are examined in light of the vertex cover problem. Although this task is 
less complicated than other combinatorial search problems, such as chess, 
it exhibits similar overheads when solved using loosely coupled parallel 
systems. Such overheads arise because pruning causes the search trees 
to become skewed, which in turn makes it difficult to schedule work to 
keep all processors productive. 

The combined overheads comprise solution time overrun, the amount 
by which a solution time is slower than linear speedup. We present 
and discuss experimental data in an attempt to account for solution time 
overrun and communication and synchronization losses, as they occur in 
loosely coupled parallel search. 


Acknowledgements. Financial support from the Canadian Natural 
Sciences and Engineering Research Council through Grant A7902 made 
the experimental work possible. 


1 Introduction 


Using a loosely coupled network of processors is a simple way to in- 
crease the processing power available to an application. This simplicity 
is appealing, especially given the increasing number of computing facili- 
ties that feature networked workstations. Any such facility is a candidate 
for becoming a loosely coupled parallel system, with the addition of soft- 
ware to control processor communication and scheduling. Thus loosely 
coupled parallel systems have the potential to be a comparatively simple, 
low-cost answer to the demand for increased computing power. 

However, for important problems such as game tree search, such sys- 
tems can exhibit heavy computational overheads that undermine their 
efficiency and limit their effective speed. Our primary motivation for 
this study was the investigation of a particular type of overhead, termed 
synchronization loss (defined in Section 2), which can severely impair 
the performance of computer chess programs. Control of this overhead 
is made especially difficult in the chess case by the complexity of the 
program itself and the difficulty in subdividing the work into smaller 
chunks that can be distributed equitably across all processors. Therefore 
we sought a simpler application whose parallel implementation exhibited. 
a similar serious synchronization overhead. We chose the vertex cover 
problem, which may be stated as follows: Given an undirected graph, 
find the smallest set of vertices such that every edge in the graph is 
incident to at least one vertex in the set [5]. 


2 Types of Overheads 
Overheads in loosely coupled parallel search fall into three broad cat- 


egories: 


*Present address: Computer Science Department, Carnegie Mellon University, Pitts- 
burgh PA, 15213-3890. 
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1. Communication overhead, in which processors wait while they ex- 
change information that may improve their efficiency. Typically this 
exchange involves updating or retrieving data from a global shared 
table, or sending and receiving messages. 


2. Search overhead, in which more nodes are searched in the parallel 
implementation than in a sequential one. One form of search over- 
head occurs when processors unwittingly do redundant calculations. 
Because work is not being done in the strictly sequential style of a 
single processor, information must be shared if processors are to de- 
tect and avoid the duplication of work; redundant calculations arise 
when information that should be shared does not arrive in time to 
prevent the initiation of extra work. Another form of search over- 
head occurs when work is deliberately assigned to processors on a 
speculative basis, in the absence of anything better for them to do. 
Here duplicated or unnecessary effort is a direct trade-off against 
idleness. 


3. Synchronization overhead, in which processors become idle after 
completing their assigned work, and cannot continue until some 
(even all) others finish completely. Clearly, if processors are idle 
due to poor synchronization, then overall speed of the system is far 
less than optimal. 


Normally there is a trade-off between communication, search, and syn- 
chronization overheads. For example, if speculative computing is em- 
ployed to reduce synchronization loss, there will normally be some in- 
crease in search overhead, and perhaps also extra communication. 


3 Effective Power of Parallel Solutions 


The power of parallel solutions is often demonstrated via the solution of 
classical combinatorial problems [13]. Especially popular is the traveling 
salesman problem, since all combinations might have to be searched and 
hence nearly linear speedup is possible [9]. Almost all combinatorial 
search problems are well-suited to a multiprocessor solution, provided 
that most work is independent (i.e., negligible data sharing is needed and 
subproblems can be solved in any order). In the ideal case, subproblems 
are also predictable in size, allowing nearly perfect processor scheduling. 

If search tree pruning techniques are used in a sequential solution, 
then in parallel adaptations of that solution processors must share infor- 
mation that leads to cutoffs. Information-sharing entails communication 
cost. More importantly, pruning in a parallel solution introduces syn- 
chronization overhead because it renders the size of a chunk of work 
unpredictable. Thus pruning, while improving sequential solutions, con- 
tributes to overheads in parallel solutions, making ideal (linear) speedup 
over the best available sequential solution difficult to achieve. 

The effective power of a parallel solution may be overestimated if 
the uniprocessor solution against which it is compared is not the fastest 
available. Consider sequential minimax search of a uniform tree (i.e., 
exhaustive search of a decision tree of fixed depth and constant branching 
factor). Simple tree-splitting will yield close to ideal speedups. There 
is no pruning, so there is no need for communication, and subproblems 
are of uniform size. However, in current parallel implementations of the 
«a — B pruning algorithm [10], tree-splitting clearly fails to achieve linear 
speedup. 

Given that pruning effects are equally desirable in a parallel solution, 
methods must be devised to control the overheads that stem from them. 


One such method is to dynamically reconfigure communication paths so 
that idle processors can be assigned to those that are still busy [12]. 
An altemative that is appealiny for the relative simplicity of its control 
structure is to tailor a static processor configuration to the application at 
hand. 


4 Parallel Solutions to Vertex Cover 


A multiprocessor solution to vertex cover was one of several potential 
applications for the MANIP architecture proposed by Wah and Ma [14]. 
A follow-up study by Zariffa [15] used a 7 processor system to gain 
working experience with some pragmatic aspects of multicomputer sys- 
tems. Sufficient implementation details were provided in that study for 
us to replicate its experiments, and thereby compare computer systems 
and processor configurations. This paper presents data gathered in the 
course of our replication. 

In the original experiments, 15 graphs were searched with 2, 4, and 
7 processors, statically configured, and scheduled using three different 
schemes. Of the scheduling methods, dynamic first level (dfl) proved 
the most effective, and is the method on which our study is based. Dfl 
assigns a first-level tree node to each processor, then dynamically assigns 
the remaining ones on a first-come, first-serve basis. 

The original search algorithm prunes branches whose solutions cannot 
better the best found so far, and therefore builds trees similar to chess 
trees in that they are somewhat skewed to the left. We devised a faster 
algorithm that builds smaller and highly skewed trees [1], but used the 
original one in our experiments to allow comparison of results. The 
existence of this faster sequential algorithm implies that our speedups 
do not necessarily reflect the effective power of parallel vertex cover 
solutions. 

We note that Zariffa’s data on the occurrence of acceleration anoma- 
lies [6,4] is inconsistent. In our own experiments acceleration anomalies 
were not observed, although they do occur in game trees [8]. 


5 Implementation Details 


In this section we present implementation factors that bear on the anal- 
ysis that follows in Section 6. 


5.1 


The hardware in Zariffa’s system consisted of Data General Nova 4s 
and one Nova 3. The system was non-homogeneous, the Nova 3 being 
roughly 10 percent faster than the others. However, because the Nova 3 
provided disk access for the system, it participated in all experiments. 

Our system had an analogous feature: six Motorola 68010s, running 
with limited operating system support, were roughly 20 percent faster 
than the seventh, which was identical but ran under UNIX! and pro- 
vided full multitasking. The UNIX-based processor (named sunshine) 
was also the only one that provided disk access for the system. To re- 
duce operating system overhead to a minimum, the kernel of the other 
processors (referred to as standalones) supported Ethernet access as the 
only I/O function, and did not support multiple application processes. 
Sunshine necessarily participated in all the experiments, but usually only 
as a master data gatherer. 


Non-homogeneous Processor Systems 


5.2 Processor Configurations 


The topology of communication paths in our system (henceforth the 
processor configuration) was different than that in the original work, 
where constraints imposed by the hardware dictated that a particular ring 
configuration was the simplest and most effective. 

In this form of ring, a record of control information, such as a list of 
completed chunks of work, is maintained by each processor. When the 
record changes, as upon the selection of a new chunk by a processor, 
the updated record is passed around the ring. Because P copies of the 
control information are maintained, updating the information requires that 


1UNIX is a registered trademark of Bell Laboratories. 
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a packet travel P — 1 successive hops before the update is complete. In the 
original experiments, an update was delayed at least 10 milliseconds (ms) 
before reaching the last processor. Thus it was possible for two nodes to 
claim the same chunk of new work within the period of communication 
latency, since they used their own copy of the control information when 
looking for work [15]. 

A broadcast bus (Ethemet) and our Virtual Tree Machine (VTM) [11] 
allowed us greater flexibility in designing a processor configuration. We 
used a single-level process tree, in which slave processes executed the 
search, and spoke only with a master process. (Although arbitrary and 
dynamic configurations are possible with the VTM, such static processor 
trees remain the most common [11].) The slave processes resided on the 
standalones, and the master, responsible for file I/O, resided on sunshine. 
The master also coordinated the search, maintaining the unique copy of 
the control record. The advantages of such a configuration are twofold: 


1. Faster broadcasting of updates. Although information must still 
travel P hops for an update (1 slave-to-master message and P — 1 
master-to-slave messages), the processor configuration imposes no 
sequential ordering on messages. The master queues all outgoing 
messages in a tight loop, without waiting for any particular one to be 
sent. Thus the P — 1 messages are broadcast nearly simultaneously. 


2. No possibility of duplicating a chunk of work. All reads and writes 
of search control information are executed sequentially on a single 
copy of the record, so the information is always consistent. 


A master/slave configuration can be susceptible to message-processing 
bottlenecks in the master, particularly at higher degrees of paral- 
lelism [15]. Our systems did not exhibit significant communication over- 
head, implying that such a bottleneck is not a factor up to parallelism of 
degree 7 [1]. 

Another issue that arises from employing a master process is the se- 
lection of a physical processor on which to place it. In our system the 
master, requiring file I/O capability, had to reside on sunshine for all 
experiments. Each slave process was assigned to its own standalone pro- 
cessor in our 2 and 4 processor systems. However, because of limited 
hardware availability when our experimental work began, it was initially 
convenient to allow sunshine to host both the master and a slave for the 7 
processor experiments. Later the effects of a doubled versus an indepen- 
dent master were explored, using a seventh (newly acquired) standalone 
processor; the results are presented in Section 6.2. 


5.3 Message Passing Operations 


Three message passing operations occur in our systems: polls, sends, 
and receives. Sends and receives copy information into and out of system 
buffers, respectively, taking small and constant amounts of time. Polls 
are of two sorts, non-blocking and blocking. Non-blocking polls are used 
by slaves to check for new bounds before node expansion, and so take 
small and bounded time. 

Immediately after a slave sends results to the master, it issues a blocking 
poll so that it can wait for new work. The interval spanned by this poll 
incorporates the time for the results to reach the master, the time for the 
master to process them and build a new piece of work, and the time for 
the new work to reach the slave’s system buffer. Thus the polling interval 
embodies all time spent in communication between processes during the 
search, as well as time spent waiting for work. 


6 Identifying Overheads 


Our averaged speedup figures show the declining effectiveness of addi- 
tional processors as the degree of parallelism is increased. For the graphs 
in our test set the speedups were as follows: 1.81 for 2 processors, 3.26 
for 4 processors, and 4.69 for 7 processors. In this section we identify 
and quantify the overheads that limited the performance of our parallel 
systems. 

We note that a positive correlation between problem size and speedup 
was discovered, which indicates that experiments with larger graphs 
might yield different figures. We have addressed the generation of large 


graphs having specified statistical properties, and have explored a variety 
of binary-tree processor configurations for solving vertex cover [2]. 


6.1 


Data that measures idle time per processor suffices to quantify commu- 
nication overhead. In the 4 processor case slaves ran as the only process 
on their respective machines, while in the 7 processor case the master 
process and one slave were doubled up. Our prediction that the doubled- 
up processor would be capable of less work as a slave is borne out by 
lower-than-average node counts (counts of nodes searched) [1]. 

The salient feature in the 4 processor data agrees well with intuitive 
prediction. For each problem at least one processor has effectively 0 idle 
time relative to solution time. (Solution times for 4 processors appear 
in the Elapsed column of Table 2. Idle times were between 30 and 40 
ms, with once exception of 118 ms.) Since the calculation of idle time 
includes time spent on communication, the communication loss for that 
processor is also effectively zero. If we assume that total communication 
overhead is equally distributed across processors, it follows that com- 
munication overhead is negligible for all processors. This assumption 
is intuitively rational, there being no apparent reason to suspect that the 
“busy” processor is biased toward a lesser degree of communication. 

The 7 processor data exhibits an interesting anomaly in the elapsed 
times of the last processor to finish. When the doubled-up processor 
is last to finish, it usually has an idle time of exactly 0. Each of the 
other processors, when last to finish, has an idle time in the range 76 
to 196 ms, a roughly constant discrepancy from the range of minimum 
times found in the 4 processor data. This noise may be an effect of the 
doubling-up of master and slave, a factor for which control was lacking. 
We subsequently investigated the effects of doubling-up, and present a 
discussion of the results in Section 6.2. 

Two other aspects of the 7 processor data bear mention. First, the noise 
in the idle-time data reinforces the need to experiment with sufficiently 
large problem instances, so that solution times overwhelm small and con- 
stant overheads that are difficult to identify. Second, there are examples 
where several processors have long idle times (on the order of 30% of 
the total running time), highlighting the inefficiency of dfl scheduling. 


Estimating Communication Overhead 


6.2. Placement of Master Process 


In Table 1 we present data on the effect of doubling a master and a slave 
on one machine. For this experiment five of the computationally more 
expensive problems were chosen, and a seventh standalone processor 
was used for the 7 processor independent master tests. Solution times 
with an independent master appear in the Master Alone columns, and the 
increases in solution time when the master was doubled up appear in the 
Master Doubled columns. 


Times (seconds) 


2 Processors 4 Processors 7 Processors 


Prob. | Master Master | Master Master | Master Master 
# Alone Doubled | Alone Doubled | Alone Doubled 
11 


12 
13 
14 
15 


Table 1: Timing Effects of Doubling Master and Slave 


The data shows that for 2. and 4 processor systems an independent 
master increases performance by a non-trivial amount. However, with 
7 processors an independent master led to some slower solution times 
(Problems 11 and 12). For these instances the increase in processing 
power was small enough to be offset by random, detrimental changes 
in the order of work allocation. Our conclusion is that, as the degree 
of parallelism increases, the benefit from an independent master tends to 
Zero. 
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We infer that having one slightly faster processor will have just as little 
effect with 7 or more processors as having a slightly slower one. This 
justifies a direct comparison [1] between our 7 processor system and the 
one used in the original study. On the other hand, in the 4 processor case 
our solution times may overestimate the power of our system, since the 
master resided on an independent fifth processor. The general conclusion 
drawn from the data is that the greater the degree of parallelism, the less 
significant the effect of having a processor of different capability. 


6.3. Estimating Synchronization Overhead 


Table 2 compares solution time Overrun with estimates of synchro- 
nization loss for the 4 processor case. The estimates, which appear in 
the Clocking and Node Counts columns, are derived from two different 
measurement methods. Solution times appear in the Elapsed column. 

Overrun is simply Elapsed — Linear Speedup, whete Linear Speedup is 
the sequential solution time divided by the number of processors. Thus 
overrun quantifies the difference between the effective and the ideal power 
of a parallel solution. | 

The Clocking method of estimating synchronization loss is just the 
sum of clocked polling intervals over all slaves, divided by the number 
of slaves. The results appear in the Average Idle column under Clocking. 

The Node Counts method is slightly more complex. A cost per node 
value is derived for a given search tree, by dividing the largest of the 
processor node counts into the solution time. The working time for each 
processor is computed as cost per node x node count, whete node count 
is for that processor. The idle time for each processor is then Elapsed — 
working time. The average of these times appears in the Average Idle 
column under Node Counts. 


Times (seconds) Synch. Loss Estimators 


Clocking 


Pr. Linear Av. Error | Av. Error 
# | Elapsed Speedup Overrun | Idle (%) | Idle (%) 
1 5.79 2.69 22 1 
2 9,35 5.82 3.53 1.83 48.2 
3 24.63 13.19 11.44 8.05 29.6 
4 37.56 33.44 4.12 4.57  -10.7 
5 10.74 8.45 2.29 1.33 41.9 
6 11.26 8.98 2.28 164 28.1 
7 22.02 17.14 4.88 3.87 20.9 
8 22.86 19,34 3.52 3.56 -1.1 
9 90.75 64.63 26.13 17.43 33.3 
10 29.78 21.59 8.19 5.90 28.0 
11 | 158.59 135.19 23.40 17.47 25.3 
12 | 109.39 81.75 27.64 18.66 32.5 
13 | 127.61 114.20 13.41 10.50 21.7 
14 | 163.50 142.40 21.10 13.52 36.0 
15 106.74 22.20 49.8 


63.69 51.91 ‘11.78 


pat) 
= 


8.12 31.1 |] 7.86 33.3 


Table 2: Synchronization Overhead Estimates for 4 Processors 


We interpret the data in Table 2 with the following reasoning: 


1. Communication overhead is negligible, as found in Section 6.1. 
Since Clocking embodies only communication and synchronization 
losses, it accurately measures the latter. Now, comparing Average 
Idle columns in Table 2, the two measurement methods correlate 
well, especially for larger problems. Thus the Node Counts method 
is also accurate as a measure of synchronization loss. 


2. Cost per node has the same expected value for all regions of the 
search tree, given the correctness of Node Counts. If this were 
not so, then some regions would be more expensive to search than 
others, and Node Counts would not generate average idle times that 
consistently agree with Clocking. Moreover, clocked working times 
per processor correlate well with node counts per processor [1], 
showing that cost per node is uniform over subtrees. 


3. Poor speedups cannot be blamed on a higher distribution of expen- 
sive nodes in a parallel search tree, given a uniform cost per node. 
Therefore, sub-linear speedup, and hence overrun, are directly linked 
to lost potential in the parallel solution. More specifically, overrun 
is linked to synchronization loss, since communication and search 
overheads are negligible [1]. 


Our reasoning derives the conclusion that each of Clocking, Node 
Counts, and Overrun accurately measure synchronization loss. Unfor- 
tunately, although the two former methods correlate well, there is a non- 
trivial discrepancy when they are compared to the latter. The discrepan- 
cies are presented as percentages of Overrun in Table 2. 

This inconsistency between arguably valid results is a matter for fur- 
ther investigation, pointing to the more general difficulty of determining 
the exact cause of reduced effective power in parallel solutions. Over- 
heads other than communication, search, and synchronization might be 
involved. It is also possible that a systematic variation in cost per node, 
not manifested in our data, causes parallel search tree nodes to be more 
expensive than sequential ones on average. 

One hypothesis was that losses due to additional startup costs in a 
parallel solution must be factored into the analysis. However, such over- 
heads are invariant over problems, and the discrepancy between methods 
is proportional to Overrun, so the category of constant overheads was 
ruled out as a possible explanation. 


6.4 Variance in Solution Times 


In order to be certain that there was no significant variance in solution 
times under different Ethernet traffic, we solved one particular problem 
ten successive times [1]. There is a slight variation in node counts be- 
tween runs, confirming the evidence of the 7 processor data in Table 1, 
namely that parallel search can be affected by small changes in the tim- 
ing of inter-processor communication. However, the differences are not 
significant, and we conclude that variance in solution times was not a 
factor in our experiments. 


7 Summary and Conclusions 


Our goal in this study was to identify and quantify overheads that re- 
duce the effective power of loosely coupled parallel systems. We chose 
vertex cover as a problem simpler than chess, but whose parallel solu- 
tions suffer comparable synchronization losses [8]. Given the analogy, 
we expect that insights that help to characterize overheads will transfer 
between the two problems. 

Through replicating a previous study [15], we were able to compare 
two hardware and software systems, and assess how their differences 
influenced solution speed and choice of processor configuration. The 
greater speed of our systems identified a need to experiment with bigger 
problem instances, so that data is not affected by small and random vari- 
ances introduced through parallelism. Using a single-level process tree 
led us to assess where best to place the master process, and to the conclu- 
sion that adding a non-standard processor to an otherwise homogeneous 
set has decreasing effect as the degree of parallelism increases. 

In implementing the original search algorithm, we devised a faster 
algorithm that searches highly skewed trees [1]. The existence of this 
faster algorithm makes it clear that the effective power of a parallel 
solution can only be assessed relative to the fastest sequential solution 
available. Design of processor configurations to effectively search highly 
skewed trees in parallel has been undertaken [2]. 

Our experiments with the original search and processor scheduling al- 
gorithms showed negligible effects from search and communication over- 
head, isolating synchronization overhead as the detrimental factor. Esti- 
mates were computed using three empirical methods, only two of which 
converged to similar numbers. The discrepancy between the Overrun 
method and the Clocking and Node Count methods is a matter for fur- 
ther research, and highlights the difficulty of accurately identifying and 
quantifying overheads. 

Once the source of losses in loosely coupled systems is clearly iden- 
tified, analysis can be extended to processor scheduling methods more 
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complex than dfl. One such method currently under study is the dy- 
namic re-allocation of processors when they become idle, either on a 
buddy basis or by stopping a given subtree search and re-starting with 
more processors [3]. Another approach employs speculative computing, 
in which newly idle processors continue with the next phase of their 
work, after assuming that the outcome of searches underway on other 
processors will not affect their work choice [10]. 
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Abstract 


This paper presents a parallel algorithm to sort x data items on an 
architecture consisting of p processor-memory-switch modules intercon- 
nected in the topology of a binary cube. The switches form a pipelined, 
packet switched interconnection network that is used to route data 
between the processors. The time complexity, including both communi- 
cation and computation costs is 6(n logn /p +log?p), yielding linear 
speedup for all p, 1 <p <n/Jlog n. 


1. Introduction 


Sorting is a fundamental computational problem that has been 
investigated for several decades. Several parallel sorting algorithms that 
achieved poly logarithmic execution times and linear speedups on the 
idealized parallel random access machine model (eg 
[3,5,7,8,11,16,17,21]) have been discovered. Efficient sorting on network 
and VLSI computational models are described in [1,4,6,13,14,18,20]. A 


coarse-grained parallel sorting algorithm for a (non-pipelined) hypercube. 


[9] achieves linear speedup only for for 1<p <logn. The reader is 
referred to [2,10] for several sorting methods, and for further references. 


In this paper we present a coarse-grained parallel sorting algorithm 
that can be mapped onto a pipelined hypercube architecture of p PEs. 
The latter is one of a class of parallel architectures referred to as ensemble 
architectures [19]. These represent a cost-effective means of implement- 
ing parallel systems and are rapidly becoming available commercially. 
Our sorting algorithm requires O(n log n /p + log? p) time, including 
both computation and communication costs, and thereby achieves linear 
speedup for allp, 1 <p <nAlogn. 


The rest of the paper is organized as follows. In Section 2, the pipe- 
lined hypercube architecture is described. Section 2.1 introduces com- 
munication graphs (CGs) and formalizes their relation to routing on the 
pipelined hypercube. In section 2.2 we show that communication traffic 
patterns arising in the sorting algorithm can be mapped onto the hyper- 
cube, to achieve conflict-free routing. Section 3 describes the sorting 
algorithm and its implementation on the pipelined hypercube. 


2. Architectural Model 


A pipelined hypercube network consists of p = 2* nodes, k = 0, 
indexed from 0 to 2* — 1, and connected in the topology of a binary cube 
of dimension k; i.e nodes i and j are connected, whenever the binary 
representations of i and j differ in exactly one bit. Figure 1(a) shows a 
binary cube for p = 8. 


In the following iz-1%,~-2...£;29 denotes the binary representation of 
integer i, 0 <i <2* -1, andi; denotes the integer whose binary represen- 
tation differs from that of i in the bit numbered 7. Each node in the 
binary cube is a processor-memory-switch (PMS) module. PMS[i] con- 
sists of a processor-memory module (denoted by PE[i]), and & switch 
boxes, (0,7), (1,7), .... (k-1,7), one for each dimension. PE[Z] is con- 
nected by a shared bus to the switches in its module. Each switch (/, 7), 
O<1<k-—2, is connected by a bidirectional, full-duplex intra-module 
link to switch (/+1, i). Inter-module links connect switches in different 
PMS modules. Formally, for each pair of nodes, i and z,, that are con- 
nected in the binary cube, there is a bidirectional, full-duplex link between 
switch (/,i) and switch (/, i;). Figure 1 (b) shows a PMS module for a 
three-dimensional cube. 


The switches form a synchronous, pipelined packet-switched net- 
work that is used to transfer blocks of data between the PMS modules. 
Three types of communication traffic that arise in the sorting algorithm 
must be supported by the network. These will be referred to as forward 
routing, reverse routing and cube routing respectively. We now describe 
the functional requirements of the switches for each type of routing. 
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A cycle of a switch consists of an odd phase and an even phase. ' 
The odd phase consists of data transfer between switches in the same . 
PMS module along intra-module links; in the even phase data is 
transferred between different modules using inter-module links. In for- 
ward routing, communication during the odd phase is from switch (/, 7) to 
(1+1, i), while for reverse routing it is from (/,7) to (/-1,7). On recelv- 
ing a packet, switch (J, i) decodes the destination address associated with 
the packet, and buffers it for transmission on either the intra-module link 
or the inter-module link to (/, i;) as appropriate. If the packet is buffered 
for transmission on the intra-module link, the packet will be transferred to 
the switch (/ + 1,7) (or (J — 1,7) ) in the odd phase of the next cycle. Oth- 
erwise, it will be transferred to (/, i,) in the even phase of the current 
cycle. 


Cube routing is employed to emulate the point-to-point connections 
of a binary cube. We require at most one switch to send (or receive) a 
packet to (from) the PE in its PMS module in the same cycle. Thus a 
shared bus between a PE and the switches in its module represents an ade- 
quate connection. 


In the next section, a formal graph-theoretic model of the intercon- 
nection network is presented, and communication traffic patterns that 
arise in the sorting algorithm are proved to be conflict free. 


2.1. Communication Graphs 


A CG (communication graph) is a directed graph whose nodes 
represent switches and whose edges represent unidirectional communica- 
tion links between the switches. Nodes with no incoming (outgoing) 
edges will be called sources (sinks). We define two CGs, F and R, on 
which the required traffic patterns are proved to be conflict free. We then 
show that a conflict-free set of routes in either F or R corresponds to 
conflict free routing on the pipelined hypercube. 


Both F and R have p (k +1) nodes arranged in k+1 levels, with 


p =2* nodes at each level. A node is denoted by (/, 7), where 1 is the 


level number, 0 <1 <k, and i is the index of the node within the level, 
O<i<p-—1. In F, a node (/, i) at level 1, O</ <k-—I1 is connected to 
the two nodes (/+1, i) and (/+1, i), by edges directed from the former 
into the latter. In R, anode (/,i),O</<k —1, is connected to the two 
nodes (/+1, i) and (/+1, i,-;-1). F will be referred to as the forward net- 
work and R will be referred to as the reverse network. (See figure 2.) 


A switch at level 1, O</ <k, examines a bit of the address associ- 
ated with a packet, and passes it at the next cycle to a switch at level 
1+1. We describe two routing operations that the switches support, 
namely least significant bit (LSB) routing and most significant bit (MSB) 
routing. The switches in F employ LSB routing while those in R employ 
MSB routing. 


In LSB routing, node (/,i) of F 0</<k-1, routes a packet to 
either node (/+1, 2) or to node (/+1, i;) depending on whether the 1% bit 
0<1<k-1, of the address field Ar matches the /” bit of i or not, respec- 
tively. In MSB routing, node (/,i) of R, OS/ < k-1, routes a packet to 
either node (/+1, i) or to node (/+1, iz-)-1) depending on whether the 
k-l-1" bit, O<1 <k-—-1, of the address field Ag matches the / th bit of 7 
or not, respectively. 


The switches in R also support a variant of MSB routing referred to 
as MSB routing with copy. This is used to implement a broadcast facility, 
in which a data packet can be sent simultaneously from a node (0, i) to all 
to the consecutively indexed destination nodes, (k, START;), 
(k, START;+1), ... , (kK, END;—1), (Kk, END;). The address field Ap now 
consists of the pair of integers (START;, END;), START; S END;, which 
define the limits within which the packet must be sent. Each switch node 
(1, j)), 0<1 <k-1, performs the following actions on receiving a packet 
of this form. If the kK-1-1" bits of START; and END; are the same, the - 
node implements the usual MSB routing to route the packet to the node 


indicated by the address START;. If the two bits are different, then the 
packet is forwarded to both the nodes (/+1, j) and (J+1, jx--1). How- 
ever, the addresses START; and END; that are forwarded to the two nodes 
are updated as follows. The copy forwarded to the node with the smaller 
index will have END; set to 2* — 1 and that forwarded to the node with the 
larger index will have START; set to zero. 
Definitions : The route in F (R) from node (0,i) to node (k, j) is the 
ordered sequence of nodes in F (respectively R), ((0, i), (1, 74), ... (, 
i'), ... (k, j)), that a packet with address Ar = j (respectively Ap 
j) passes through. The sequence of edges between nodes in the route 
is the path of the route. A route in F is referred to as a forward 
route, while a route in R is referred to as a reverse-route. Two routes 
are said to be conflict free if they are node disjoint. A set of routes is 
conflict free if they are pairwise node disjoint. 


We now relate F and R to the pipelined hypercube of section 2.1, 
and show how conflict-free routes in F (or R) imply link-disjoint routes in 
the pipelined hypercube. 


In the following let Hy (Hr) refer to the graph obtained from F 
(respectively R) by replacing the directed edge ( (i, u), (1+1, m)) (respec- 
tively ( (J, u), (141, ux-1-1) )) with the directed edge ( (+1, u), (/+1, m)) 
(respectively ( (1+1, u), (J+1, %-1-1) ) ). Figure 3 shows Hr and Hp 
corresponding to F and R of figure 2. To demonstrate the similarity of the 
switches and links of the pipelined hypercube network to the nodes and 
edges of Hr and Hr, the nodes in figure 2 have been renumbered as fol- 
lows. Node (J, u) in Hp is renamed (/ — 1, u), and node (J, u) in Hp is 
renamed (k —1,u), 1<1 <k. Node (0, u) is shown as PE[u]. It may be 
seen that with this renaming of nodes, Hr maps directly onto the switches 
and links of the hypercube network used for forward routing, and Hr to 
the switches and links used for reverse routing. 

Definition : A route in Hr is obtained from a route in F by replacing 
every edge ( (1, u), (J +1, %)) in the latter path, by the two directed 
edges ((/,u),(1+1,u)) and ((1+1,u),(1+1,m)), O<S1 <k-1. 
Similarly, a route in Hr is obtained from a route in R by replacing 
every edge ((/,u), (1 +1, i%1-1)) by the two directed edges 
((l,u), (+1, u)) and( (2 +1, uv), (2 +1, t4-4)), OS 1 <k-1. 

Theorem 2.1 : Let P; and P2 be the paths of two node disjoint routes in F 
(or R) and P, and P, be the corresponding paths in Hr (respectively 
Hp). Then P, and P, are edge disjoint. 

Proof: By contradiction. If an edge e exists that is common to P, and P, 
then, 
ife =((I,u), (1+1, u)), the node (J, u) is common to both P; and 


else e = (l,u), (1, @)), where @ = uj_; or u%-) according to whether 
P, and P», are from F or R, and it follows that the node (/-1, 
u) is common to the two paths. 


The theorem shows that if the required set of routings can be shown 
to be conflict-free in F (or R), then the routes using forward (respectively 
reverse) routing on the pipelined hypercube will be link disjoint. 


2.2. Conflict Free Routing 


In a series of lemmas we describe several routing patterns that arise 

in the sorting algorithm, and show that they are conflict-free in F or R. 

Closely related results for performing certain of these routings on a binary 

cube have been previously reported in [15]. As a consequence of 

Theorem 2.1, these routings can be performed without link conflict in the 

pipelined hypercube using forward or reverse routing respectively. 

Lemma 2.1: Let (/, u) be a node on the route from (0,7) to (k,j) n F 
(R). Then the binary representation Of u is ip—pig-p...U) Ji—1jt-2--.J0 
(respectively, Jx—1Jk—2---Jk—1 tk-1-1lk-1-2...10). 

Proof: Direct consequence of LSB (MSB) routing. 

Lemma 2.2:Let ( (0, s(i)), (k, t@)) ), OSi <r-—1, be a collection of r 
pairs such that, 0 < s(0) < s(1) <... < s(r-1)<2*-1, O<1(0)< 
t(1) <...<t(r-1) < 2* — 1, and s(i+1) - si) = t@ +1) - ti), for alli, 
0 <i <r-2. Then the set of routes in F from each node (0, s(i)) to 
the node (k, t(z)) 1s conflict-free. 

Proof:Consider any pair i, j and without loss of generality assume that 7 
>j. Letu =s(i), v =s(j), x =t(i) andy =¢(/). 

Assume, contrary to the lemma, that (/, w) is a node that is 
common to the two routes. From lemma 2.1 above, w = ug_1Uz-2...U} 
Xj-1[X]-2...X0 Vk-1Vk-2+--V} = Yl-1Yl-2---¥0- Then, u-ve< 2! and 
x —y 22', which contradicts the fact thatu-—v >x-y. Sincei 
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and j were arbitrary, the set of routes is conflict free. 

A similar lemma holds for the routes in R: 

Lemma 2.3 : Let ( (0, s(i)), (Kk, t(@i)) ), O< i < r—1 be a collection of r 
pairs such that, 0 < s(0) < s(1) <... < s(r-1) < 2-1, 0< t(0) < 
t(l)<...<t(r-1) <2*-1, and, s(i+1) - s(i) < t(i41) - t(i), for 
alli, O<i<r-2. Then the set of routes in R from each node (0, 
s(i)) to the node (k, t (i )) is conflict-free. 

Definition: Broadcast routing is defined as follows. Let { (0,7) | 
0 <i <r-1 }, be a set of sources in R. Associated with each source 


(0,i), is a pair of integers, START; and END; such _ that 
END; = START;. For all i, O<i<r—2, START;,; = END; +1, and 
END;-;=p-1. Route data from (0,7) to all (k,u), 


START; Su <END;. 


Broadcast routing is performed using MSB with copy in R. The fol- 
lowing lemma can be shown to hold (see [22]) for broadcast routing: 
Lemma 2.4 : Let (0,7) and (0, j) be two sources involved in a broadcast 

operation. Then the routes from (0,/) to (k, uw) and from (0, j) to 
(k, v), for any pair u, v, START; Su < END; and START; Sv SEND, 
are conflict free. 


Corollary : Broadcast routing is conflict free. 


Combinations of special cases of the routings implied in the above 
lemmas arise in the sorting algorithm. The special case of lemma 2.3 
when ¢(i) =i, is known as Concentrate. A Concentrate followed by a 
Broadcast is known as Scatter. The special case of lemma 2.3, when 
si) <t(t), O<i <r—1, is known as Right Transfer routing; the case 
when s(i) > ¢t(i) for all i, is correspondingly referred to as Left Transfer. 
A Left Transfer followed by a Right Transfer is referred to as a Weave. 
All of these routings can be performed using link disjoint routes on the 
pipelined hypercube. 


3. Main Algorithm 


The sorting procedure is a parallelization of the standard merge-sort 
as shown below in the procedure Parallel Merge Sort. The crux of the 
procedure is the parallel algorithm for merging two ordered lists, as 
described in section 3.1. Section 3.2 contains the complexity analysis. 

Parallel Merge Sort (W[0 .. N-1], M) 
/*Sort array W using M = 2% PEs. Letm=N/M */ 
/* Let W[im .... (i+1)m — 1] be denoted by W; */ 


1. PE[i],7,0<i <M —1, is loaded with W;. 


2. Each PE[i] independently sorts W; into increasing order. 
3. Forj =OtoK —1do 
For each s, O0<s <2%-!-/ -1, do independently and con- 
currently : 
/* Leti =s 2s+1 */ 
I* Let As = (W;...W;45;_1) and Bs = (Wj 45;--W4;n-4) */ 


Merge A, and B, using procedure Coarse Merge(As, Bs, 2/*'). 
End. 


End Parallel Merge Sort. 


3.1. The Merging Procedure CoarseMerge 


Let A[O ...—1] and B[O ...n—1] be the two sorted arrays of ele- 
ments to be merged, where the elements are drawn from a totally ordered 
set. The arrays will be merged using 2p, p =2*-!, PEs. Let m = n/p. 
The elements A[im...(i+1)m-1] are stored in PE[i] and elements 
Blim...+1)m-1] in PE[p +71], 0 <7 <p-1, and are referred to as blocks 
A; and B; respectively. The output of the merge is the sorted array 
C[0...2n—1], such that PE[1] contains elements C[im...(i+1)m-1]. 


The merging algorithm consists of three phases. The first phase 
consists of a fine-grained merging procedure that is performed on a list of 
representative elements, one from each block. The outcome of the fine- 
grained merge is used to pair each block of A with some block of B and 
each block of B with some block of A. If block A, (B,,) is paired with 
block B, (A,), then in the second phase, B, (A,) is transferred to the PE 
containing A, (B,). After this data transfer step, a PE will have two 
blocks A, and B,. The PE merges the two blocks into a single sorted list, 
removing duplicates in the process. In the final phase, the individually 
merged blocks are routed to the appropriate processors, so that the output 
is ordered according to the chosen convention. 


In the first phase, elements A[im] (and B[im]), the smallest ele- 
ments in each of the blocks and denoted by a; (respectively 5;), are 
merged into a single sorted subsequence I = fo, ty ... fap-1. 


Definitions : An element ¢;, 0<i <2p—1, in T is a border element if 
either 4; = a, and tj4; = by, ort; = b, and tj4; =a, (OS v,w <p-l). 
It is convenient to define non-existent elements ¢_; and ta, as border 
elements. A subsequence of I consisting of elements ¢;, tj41, tit, is a 
run if none of ¢;, tj41, ... tj4,-1 are border elements. If i = 0, then the 
run is called the first run. For eachi, 0 <i < 2p-1, partner (i) = k if 
t, is a border element, and fy+1, te42, ... fj is arun. If ¢; occurs in the 
first run then partner (i) = -1. For each element ¢;, 0 <i < 2p-1, if t; 
= aj, 0S j <p-—l, then define block T; to be A; ; if t; = 6; then 
define T; to be B;. It is convenient to consider the non-existent block 
T_; as a block of m elements, each of value — °° (the smallest possi- 
ble value), and the non-existent block T2, as a block of m elements 
each of value + co (the highest possible value). ) 


For phase 2 of the algorithm, block T; is paired with Tpartner(i). AD 
example for the case p = 4, is shown in figure 4. PE[0] through PE[3] 
contain the 4 blocks of A, Ao, Ai, Az and A3, and PE[4] through PE[7] 
contain the blocks Bo, B;, Bz and B3 respectively. The pairings of blocks, 
as required by the partner relation are shown by arrows. In the example, 
Bo, B; and B2 are all paired with block A;, A2 is paired with Bo, B3 is 
paired with A> and A3 is paired with B3. Each block is sequentially 
merged with the block with which it is paired. When merging block 7; 
with Tparmer(i), the PE; removes duplicates by enforcing the following 
uniqueness conditions: 

(i) discard all elements less than ¢; (ii) discard all elements greater 
than or equal to ¢;41. 
In the example, the merge of Bo and A; retains only the elements in the 
range [bo, b;). Similarly the merge of A2 with B» retains only the ele- 
ments in the range [a2, b3); and the merge of B3 with A2 retains only the 
elements in the range [b3, a3). 


Let U;, OSi <2p-—1 be the block resulting from the merge of T; 
and Tparmer(i) and the removal of duplicates. | U; | denotes the cardinality 
of U;. Let U; = [uj 0, ui,1, ..., ui], with| U; | =r +1. Define rank (u;,;), 
0<j <r, as the number of elements of A and B that are less than uy; ;. 
Let a; and B; be the number of elements removed due to uniqueness con- 
ditions (i) and (ii) respectively. (Define o; = m for blocks paired with 
T-1, and define Bop-1 = 0.) 

In the final step, we construct the block C; of the sorted sequence 
by discarding the first m — a; elements of U; and concatenating to the end 
of the list that remains, the first m — 04; elements of U;4;. Corollary to 
lemma 3.4 in this section will establish that this can be done concurrently. 
(Proofs of the following lemmas have been omitted due to space limita- 
tions and are available in [22].) 

Lemma 3.1: For eachi, O<i < 2p -1, B; + Oj4, =m. 

Lemma 3.2 : For eachi, O<i < 2p—1,| U; | = 2m — (a; + B;). 

Lemma 3.3 : For eachi, O<i < 2p—1, rank (uj) =m (i-1) + Q;. 

Lemma 3.4 : For eachi,O <i <2p-—1,| U;| >m - a. 

Corollary : (m — a;) elements can be simultaneously transferred from U; 
to Uj-1,0 <i <2p-1. 

Theorem 1: The sequence of blocks Co C; C2 ... C2p-1 is sorted, with | 
C;| =m for eachi,O <i <2p-1. 

Proof :| C; | =| U; | + a; - 0:4; =m from lemmas 3.1 and 3.2. That all 
elements of C;_; are in sorted order and smaller than elements of C; 
is immediate from the fact that U; are in sorted order. 

Following is the formal description of the procedure Coarse Merge. 


Procedure Coarse Merge (A, B, p) 
/* Merge 2 sorted sequences A and B, each of size ¥2m p, onap PE sys- 
tem. The first pl2 PEs contain elements of A, while the last pl2 contain 
elements of B. */ 


denote: the processor element i by PE;, the data in PE; by D;, and the 
minimum element of D; by d;. 
define the following macros : 
left(i) = {-1ifi =0 or p/2;i-1 otherwise} 
right(i)= {p ifi = p/2-1 or p-1;i+1 otherwise} 
border(i) = {TRUE iff RIGHTRANK(i) > RANK(i) + 1}. 
firstrun(i) = {(@ < p/2)\(RANK(i) = i)) V (i 2 p/2)A(RANK(Z) = 
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i—p/2))} /* BOOLEAN */ 
The following integers are used by each PE;: 

RANK(i) : the rank of d; following the fine-grained merge. Define 
RANK[-1] = -1 and RANK[p] = p. 

RIGHTRANK(i ) : the rank of dyight(i)- 

PARTNER(i ) : the block with which D; is paired. 

RIGHTLIMIT( ) : = dj, where RANK[j ] = RANK[Z]+1. 

NEXT(i) : defined if (border(i)); = Id. of next ranked block. 

BEGIN-SCATTER(i ) : defined if (border(i)); | =Id. of the first block to 
pair with Dj. 

END-SCATTER(i) : defined if (border(i)); 
pair with D;. 

a; : Elements discarded due to condition 1. 

COUNT(i) : Number left after duplicate removals. 

BORDERRANK(i) = number of PE;, j <i, such that both j andi 
are from the same half of the PE array and border(j) is TRUE. 


= Id. of the last block to 


begin 


1 (a). Each PE;, 0 <i <p-—1, creates a record X; = ( VAL, INDEX) with 
VAL = d; and INDEX = 7. 


1 (b). The processors sort { X; } using a fine-grained parallel compu- 
tation. At the end of this step, each PE;, has RANK(Z) and 
RIGHTLIMIT(i ) correctly initialized. 


2 (a). Every PE; except PE, and PE,, creates the record R = ( 
rightrank), with rightrank = RANK(i), and sends it to PE;-1. 


2 (b). Every PE; except PE,/2-1 and PE,-; receives records R sent in 
step 2 (a). All PE; perform the following assignments: 
1. if (right(i) # p) then RIGHTRANK(:) = R.rightrank 
else RIGHTRANK(i ) = p. 
2. if (NOT(firstrun (i)) then PARTNER(i) = RANK (i) - i + p/2-1. 
3. if(border (i )) then 
if(firstrun (i)) then NEXT(i) = rank (i) -i + p/2 
else NEXT(i) = right( PARTNER( ) ) 
BEGIN-SCATTER(i ) = NEXT(i) 
if(right(i) # p) then END-SCATTER(Z) 
RIGHTRANK(i) - RANK (i) - 2 
else if(i = p/2 —1) then END-SCATTER(L) = p —1 
else END-SCATTER(Z) = p/2 —1. 


NEXT(i) + 


Using FINE-GRAINED PREFIX, each PE; such that border (i) is 
TRUE computes BORDER-RANK(! ). 


/* Border PEs scatter data blocks */ 


if( border (i) ) then scatter(D;, Ay (i), {Ar1(), Ar2(é)} ) where, 
Ar = BORDER-RANK(i), A,;1 = BEGIN-SCATTER(Z), and A, 
END-SCATTER( ). 


Each PE;, 0<i <p —1 such that firstrun(i) = FALSE receives 
the block Dpartneri) sent in step 4, merges the block D; with 
the block Dpartnerg), discards duplicates (elements < d; or = 
RIGHTLIMIT(i )), and sets Oy. 

Each PE; such that firstrun (i) is TRUE, sets &; =m. 


Let V; denote the data that remains in PE;. COUNT(i) =| V; |. 
/* Permute data blocks into increasing order of rank */ 
6 (a). Each PE; sends (V;, COUNT(i), 0) to PE panx(i)- 


6 (b). Each PE; receives the data sent out from some PE; (such that z 
= RANK(j) ) in step 6 (a). Let U; denote the data received by 


PE;;1.e., U; = Vj. PE; sets & = Oj and COUNT = COUNT{(/ ). 
/* Equalize data in each PE */ 


Each PE;, 0 <i <p —1 sends the first m-o elements of U; to 
PE;-_;, and deletes them from U;. 


7 (a). 


7 (b). Each PE;, 0 <i < p —1 receives the elements sent in step 6(a), 
and concatenates them to Uj. 


end 


Steps 1 and 3 of Coarse Merge consist of fine-grained parallel com- 
putations. By fine-grained computation we mean computations which 
deal with sparse data sets, one data record per processor. In step 1 the 
two sequences of representative elements ao ... dp2-1 and bo ... bpa-1 are 


merged into the sorted sequence ’. This is most simply performed using 
the well known bitonic merge algorithm to merge the records X;, 
O<i<p/2-1 and X;, p/2 <i < p —1. (The implementation is standard 
and details appear in [22].) Following bitonic merge, PE[i], O<i <p —1, 
contains the record X;, whose rank is equal toz. The rank and the value 
of the next higher element are returned to PE[j]. At the end of step 1, 
each PE[i] has its local variable RANK(i) set equal to the position of its 
representative in I’ and its variable RIGHTLIMIT(Z) set to the value of 
teavx(i)+1, the representative immediately to its right in I. Figure 5(a) 
shows the situation assuming the sorted list of representatives of Figure 4. 


In step 2, each PE determines if its representative is a border ele- 
ment, and if so determines the range of addresses over which it has to 
broadcast its data block. PE[i], i #p/2-—1 andi #p —1, is a border PE 
only if RANK(i+1) > RANK(Z) +1. (PE[p/2- 1] and PE[p—1] are handled 
specially.) In step 3 each border PE[i], 0 <i <p — 1 determines BORDER- 
RANK(z ), the number of border PEs to its left and in its half of the proces- 
sor array. This information is used to route the data blocks in a conflict 
free manner in step 4. 


In step 4, the requisite data blocks are broadcast. The details are 
discussed in section 3.1.1. With respect to the example (see figure 5(b) ), 
PE[1] broadcasts block A; to PE[4], PE[5] and PE[6], A is broadcast to 
PE[7], B2 to PE[2] and B; to PE[3]. In step 5, processors merge the 
resident block with the block broadcast to it, and remove duplicates to 
obtain the sorted blocks, V;, O0<i <p — 1. Note that the blocks will be in 
permuted order, the permutation being determined by RANK(Z) (see figure 
5(c)). In step 6 the blocks are permuted into rank order. Finally in step 7, 
each PE[z] transfers m — a; elements to PE[i — 1], to equalize the number 
of elements in each PE. 


The FINE-GRAINED-PREFIX operation of step 3 of Coarse Merge is 
implemented as follows. Each PE[i] such that border(i) is true sets a flag, 
Flag (i) to one; the other PEs set Flag (i) to zero. Each half of the proces- 
sor array performs a parallel prefix computation [15] (also referred to as 
partial sums computation), to determine the sum of the flag values that lie 
in its half and with indices smaller than it. (The details of the implemen- 
tation on a hypercube appear in [22].) 


3.1.1. Mapping on the Network 


In this section we discuss how the various steps of procedure 
Coarse Merge can be implemented on the pipelined hypercube. Step 5 
consists of a sequential merging algorithm executed concurrently and 
independently by each PE. Transfers of data blocks are required in steps 
4,6 and 7. We discuss each of these now, starting with the simplest. 


Step 7 consists of a Left-Shift by one. By Lemms 2.2 (or 2.3), this 
can be accomplished by the in a conflict-free manner, by using LSB rout- 
ing (or MSB routing) on the forward network. 


The data transfer required by step 6 is a Weave routing. It is there- 
fore performed in two stages. The first stage is a Right-Transfer where all 
blocks A; are transferred from PE[i], 0<i <p/2-—1, to PE[RANK(i)]. 
The second stage is a Left-Transfer where all blocks B; are transferred 
from PE[z + p/2], to PE[RANK(z) ]. Note that for each pair A;, A;, (and 
for each pair B;, B;), 7 >t, RANK[/] - RANK[i] 2 j —i. Hence, in both 
cases the conditions for conflict-free routing are satisfied. Each stage can 
be accomplished using MSB routing on the reverse network. 


Data routing required in step 4 is also done in two stages. In the first 
stage, all blocks from a border PE[i] are transferred to PE[BORDER- 
RANK(i )}. This is a Concentrate operation performed independently on 
each half of the processor array and is conflict-free under LSB routing on 
the forward network (Lemma 2.2). In the second stage, the concentrated 
blocks are broadcast from PE[BORDER-RANK(i)] to PEs in the range 
PE[BEGIN-SCATTER(Z )] to PE[END-SCATTER(i)]. Note that both the broad- 
cast Operations can occur concurrently. 


The remaining steps can be performed directly using the cube rout- 
ing portion of the network. At the end of step 1, ranks of X; are returned 
to PE[:], which is exactly the reverse of that occuring in step 5 -- except 
that only 1 record per PE is involved. This can be accomplished in a 2 
step conflict-free manner: first all PE[i] containing representatives from A 
send their records using LSB routing; next those containing records from 
B do the same. The time required is upper-bounded by 2 log p routing 
steps. 
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3.2. Complexity Analysis 


The time required for procedure Coarse Merge can be upper 
bounded as follows. Note that the algorithm uses p PEs and each PE has 
m data items. The time required for the fine-grained computations in 
steps 1,2 and 3 is O(log p). The data routing required in steps 4, 6 and 7 
can be performed in O(m + log p) time. The sequential merging of two 
blocks of size m each in step 5, requires O(m) time. Thus the time for 
procedure Coarse Merge is given by : 

T merge (mM, p) = O(m + log p). 
The time required for procedure Parallel Merge Sort, Tson (N,P), is 
derived as follows. Note that the procedure sorts a list of N items using 
P =2* PEs. The time required to perform the independent sort in each 
PE, using sequential mergesort (for instance) is O( N/P log(N/P)). The 
time for the j iteration, 1<j <K,is Tmerge (N/P,2/). Hence: 


Toon (N,P) < ¢\(NIP +NIP log (NIP)) +2 > (WIP +3) 
j= 
which is O(N log N)/P + log? P ). 
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Concurrent Insertions and Deletions in a Priority Queue’ 


V. Nageshwara Rao and Vipin Kumar! 
Department of Computer Sciences, 
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Austin, Texas 78712 


Abstract 


The heap is an important data structure used 
as a priority queue in a wide variety of parallel al- 
gorithms (e.g., multiprocessor scheduling, branch- 
and-bound). In these algorithms, contention for 
the shared heap limits the obtainable speedup. This 
paper presents an approach to allow concurrent in- 
sertions and deletions on the heap. Our scheme 
has much lower overheads and gives a much better 
performance than a previously reported scheme. 
The scheme also retains the strict priority ordering 
of the serial access heap algorithms; i.e., a delete 
operation returns the best element of all elements 
that have been inserted or are being inserted. Our 
experimental results on the BBN Butterfly parallel 
processor demonstrate that the use of concurrent 
heap algorithms in parallel branch-and-bound im- 
proves its performance substantially. 


1 Introduction 


The heap is an important data structure used as a priority 
queue in a wide variety of parallel algorithms (e.g., multipro- 
cessor scheduling, branch-and-bound[10]). In these algorithms 
each processor performs an access-think cycle. Every proces- 
sor executes its current subproblem at hand (thinking), then 


accesses the shared heap to insert subproblems if it generated 
any and takes the best available subproblem in the heap to solve 
next. Since many processors are sharing the heap and they may 
access the heap at the same time, the simplest way to provide 
consistency in updates is to serialize the updates. A lock is 
associated with the heap and the processors access the heap 
under mutual exclusion. This serial access scheme limits the 
number of processors that can be used to speedup the problem. 
If Tinink 18 the mean think time and Toaccess is the mean access 
time, then clearly the maximum speedup 

achievable is 


< TD iecees + Dihink 


T recess 


Tihink 18 a characteristic of the problem being solved. Tyccess 
depends on the priority structure being used. For the heap, 
Taccess is O(log N), where N is the size of the heap. 

One way to alleviate the limitation is to let many processors 
access the heap simultaneously. Updates on different parts of 
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fice of Naval Research Grant N00014-86-K-0763 to the computer sci- 
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a heap can proceed concurrently provided they do not interact 
with each other. Let us view the heap as a binary tree with 
the root at the top and leaves at the bottom. In the ordinary 
serial heap algorithm, the deletes manipulate the heap level by 
level going from top to bottom, while inserts manipulate it from 
bottom to top. Hence many insertions (or many deletions) can 
be executed in parallel by using a simple locking scheme{1]. 
But inserts and deletes can not be active together, as they pro- 
ceed in opposite directions and hence can deadlock. Biswas and 
Browne[1] present a scheme to handle this problem. But their 
scheme has a substantial overhead, and performs worse than 
the sequential heap unless the heap size N is very large. 

This paper presents a new concurrent heap access scheme 
that has small overhead, and is able to perform better than the 
sequential heap even for small heaps. Two important ingre- 
dients of this scheme are (i) a heap insertion algorithm which 
manipulates the heap from top to bottom; and (ii) a scheme to 
combine a delete operation with the most recent unfinished in- 
sertion operation. Since these new insertions and the deletions 
move from top to bottom in the heap, they can both be active 
together without causing deadlock. 


2 Preliminaries 


A heap is a complete binary tree of depth d[5]. with the property 
that the value of the key at any node is less than the value of 
the keys at its children (if they exist). 

It is efficient to implement the heap using an array. The 
root occupies location 1 and the node i occupies location i. The 
children of node i occupy locations 2i and 2i+1. The parent of 
node i is at |£|. We assume that each node in the heap has a 
key pointing to a field of data. Key(i) denotes the key located 
at node i. VALUE(i) denotes the value or priority order of the 
key at node i. Empty nodes in the heap are assumed to have 
keys with value MAXINT (= oo). 

We denote the left son and the right son of node i by 
LSON(i) and RSON(i) respectively. The parent of node i is de- 
noted by PARENT(i). Associated with the heap are the data 
fields lastelem and fulllevel!. lastelem is the index of the last non- 
empty node of the heap. The keys of all nodes beyond lastelem 
is MAXINT. fulllevel is the index of the first node in the deepest 
level of the heap (that contains at least one non-empty node). 
For an empty heap, lastelem = fulllevel = 0. Fig 1 shows a 
sample heap of twelve elements, and the value of lastelem and 
fulllevel. 


'The conventional delete and insert operations[5] do not need to 
maintain fulllevel but it is needed for the insert_t operation which 
traverses the heap from top to bottom. 


The operations supported on a heap are insertion and dele- 
tion. The insert operation inserts a new key, nkey, in the heap 
and the delete operation returns the smallest key in the heap. 
The reader is referred to [5] for the details of these serial-access 
heap algorithms. 


3 Inserting from Top 


It is possible to perform insertions from the top by using the 
following (informally stated) algorithm?: 


k <1; 
if VALUE(k) > VALUE(nkey) 
then Exchange(key(k),nkey)) ; 
while (k has both successors) 
k any successor of k; 
if VALUE(k) > VALUE(nkey) 
then Exchange(key(k),nkey)) endif 
endwhile 
Put nkey at one of the empty leaves of k. 


This naive insertion algorithm is not guaranteed to grow 
the heap level-by-level, which is crucial for the efficiency of in- 
sertions and deletions.? Our new insertion algorithm, which we 


call insert_t, performs reheapification in such a way that each 
insertion adds a key to the first empty node in the heap (just 
as in the conventional insert operation). 

Let target be the first empty node in the heap. The in- 
sertion path is the path between the root and target. This 
path is unique (and can be easily computed) because the heap 
has a tree structure. Values of the nodes on the insertion path 
(from root to target) are nondecreasing. To insert a new key in 
the heap, we need to put the new key at a proper node on the 
insertion path, and move all the keys at and below this node 
one level down (filling the target node). The conventional in- 
sert algorithm does this by visiting the nodes on the insert path 
from bottom to top. The insert_t algorithm given below does 
it in the opposite order. 


insert_t(nkey,heap) 

Lock(heap) 

lastelem —lastelem + 1 ; 

target «lastelem ; 

if (lastelem > fulllevelx 2) 

then fulllevel = lastelem endif 

i target - fulllevel ; 

/* iis the displacement of target */ 

j +fulllevel/2 . /* j — glength of insertion path —1 * / 

k —1 ; /* k is the current position 

in the insertion path */ 

/* Reheapification Loop */ 

while (j # 0) 
if(VALUE(k) > VALUE(nkey)) 
then Exchange(nkey,key(k)) endif 
if (i > j 
then {k —RSON(k); i <i - j;} /* Go Right*/ 
else {k —LSON(k)}; /* Go Left */ 
endif 
ji/2; 

endwhile 

key(k) —nkey ; 

Unlock(heap) ; 

end_insert_t 


208 


Note that the insertion path is being computed on the fly. 
Let i be the displacement of target at the last level (ie., i = 
lastelem - fulllevel), and p be the length of the insertion path. 
If we view i as a p bit binary number, then the bits of the 
binary representation of i (from the most significant to the least 
significant) tell us whether to go right (if 1) or left (if 0) when we 
go from the root downward. For example, the first element at 
the last level (given by fulllevel) has displacement 0 and its path 
is left left left---. Fig 2 shows the twelve element heap of Fig 1 
to which a thirteenth element is being added. It also shows the 
values of fulllevel, lastelem, and i just before the execution of 


[Status code | Meaning] 
| Present | A key exists at the node. | 


An insertion is currently in 
progress which will ultimately 
insert a key at the node 


| Wanted | A deleter is waiting for the key. 
| Absent — | No key is present at the node. | 


Table 1: Meaning of various status codes. 


Pending 


the reheapification loop in insert. For a proof of correctness 
of insert_t, see [9]. 


4 Concurrent access algorithms 


A simple locking strategy is embedded into delete and insert_t 
routines to achieve concurrency in access maintaining consis- 
tency in updates and avoiding deadlocks. Instead of locking 
the whole heap (as done in the serial access scheme), we lock 
only a small portion of the heap at a time. This portion is 
called window. It consists of 3 nodes for the delete algorithm 
and 1 node for the insert operation. In order to allow window 
locking, we associate a lock with every element. Each processor 
accesses the contents of a node only after locking it to ensure 
mutual exclusion. The two other data fields of the heap, ful- 
llevel and lastelem, are modified only in the initialization phase 
of the insert_t and delete routines. Hence we associate the lock 
of node 1, the root, with these fields also; i.e., a processor can 
access these locations only when the root has been locked. 

Although insert_t and delete both manipulate the heap from 
top to bottom, there is one problem in letting them work to- 
gether. Recall that the delete operation deletes the key at the 
root and replaces it with the most recently inserted leaf key (and 
starts reheapification). If the last insert_t operation is still in 
progress, then this last leaf node does not have a key. If delete 
picks up the key of any other leaf node, then the resulting heap 
may become unbalanced. If the delete operation waits for the 
last insertion to finish, then we loose concurrency. 


To solve this problem, we associate a field called status with 
every node in the heap. The status of a node can have four val- 
ues, each associated with the semantics given in Table 1. When 
an insertion starts, the status of its target is set to pending. If 
a deleter starts working when an insertion is still in progress, it 
changes the status of the target of the last inserter to wanted, 


*Throughout the paper we present algorithms in a C-like English 
pseudo-code. 
3If the heap becomes unbalanced, then inserts and deletes can take 
up to O(N) operations rather than O(log N) operations. 


and waits. After every step of reheapification on the insertion 
path, the inserter checks to see if the status of target has be- 
come wanted. If this is the case, then nkey is placed at the 


root and the inserter quits. Once the key is placed at the root, 
deleter starts working. The concurrent deletion and insertion 
algorithms are presented below. 


Concurrent Delete(heap) 
Lock(1); 
/* Lock the root of the heap */ 
if (lastelem = 0) 
then {Unlock(1); Return(NULL)} endif 
least —key(1) ; 
il; 
j .—lastelem ‘ 
lastelem <lastelem - 1 ; 
if (lastelem < fulllevel) 
then fulllevel —fulllevel/2 endif 
if (j=1) 
then{ key(1) —MAXINT; status(1) -ABSENT; 
Unlock(1); Return(least)} endif 
Lock() ; 
if (status(j) = PRESENT) 
then {key(1) —key(j); status(j) -ABSENT; 
key(j) ~MAXINT;} 
else {status(1) —ABSENT ; status(j) -WANTED}; 
endif 
Unlock(j) ; 
while (status(i) = ABSENT) do {Wait()} 
endwhile /* i = 1 at this point */ 


Lock(LSON(i)) ; Lock(RSON(i)) ; 
/* Reheapification Loop * 
/* Let MIN(i) give index of the son of i which 
has lower VALUE*/ 
/* Let MAX(i) give index of the son of i which 
has higher VALUE*/ 
while (VALUE(i) > VALUE(MIN(i))) do 
Exchange(key(i),key(MIN(i))) ; 
Unlock(i) ; Unlock(MAX(i)) ; 
i —MIN(i) ; 
Lock(LSON(i)) ; Lock(RSON(i)) ; 
endwhile 
Unlock(i) ; Unlock(LSON(i)) ; Unlock(RSON(i)) ; 
Return(least) ; 
end_Concurrent_Delete 


Concurrent Insert(nkey,heap) 

Lock(1) /* Lock root of the heap */ 

lastelem <lastelem + 1 ; 

target <lastelem ; 

if (lastelem > fulllevelx 2) 

then fulllevel —lastelem endif 

i target - fulllevel ; 

/* iis the displacement of target */ 

j <fulllevel/2 /*j = dlength of insertion path —1 si 

k «1 ; /* k is the current position 
in the insertion path */ 

status(target) —PENDING ; 


/* Reheapification Loop */ 


while (j # 0) 
if (status(target) = WANTED) 
then break endif 
if (VALUE(k) > VALUE(nkey)) 
then Exchange(nkey,k); endif 
if (i > j) 
then /* Go Right */ 
{Lock(RSON(k)); Unlock(k); 
k «~-RSON(k); i <i - j} 
else /* Go Left */ 
{Lock(LSON(k)); Unlock(k); 
k ~LSON(k)}; 
endif 
j —j/2 ; 
endwhile 
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if (status(target) = WANTED) 
then /*Some deleter is waiting at the root 
to pick the key at target */ 
{key(1) <nkey; status(target) ABSENT; 
status(1) PRESENT} 
else 
{key(target) «nkey; 
staan (eueget) PRESENT}; 
endif 
Unlock(k) ; 


end_Concurrent Insert 


Whenever an inserter or a deleter moves down 1 level by 
incrementing k or i, it first locks the next node and then releases 
the current lock. This ensures that concurrent deletes or inserts 
proceeding in the same path progress in a strict queue order 
without any interference. Since the locking sequence is in the 
strict increasing order of node indices, there are no deadlocks. 
See [9] for a proof of correctness. 


5 Experimental Evaluation 


We have implemented the concurrent access heap algorithms 
and the serial access heap algorithms on BBN Butterfly to test 
their performance. Using each scheme, P processors performed 
a total of 1000 delete or insert operations (each processor per- 
formed 1000/P operations. P was varied between 1 and 30). 
Think time Tipin, was set to roughly 5 times the heap access 
time of the serial operation. The speedup was computed as 
follows: 


Time taken by P processors for 19 i operations 


Time taken by 1 processor for 1000 operations 


Relative performance of the concurrent heap was studied 
for the following cases. 


Case I: Deletes 
In this case each processor performed one delete operation, in 
each access-think cycle. A total of 1000 delete operations were 
performed on a heap that initially had 2048 elements. Thus, 
the depth of the heap remained 10 for all the deletions. For 
the serial access scheme, the speedup was fairly linear up to 5 
processors, but saturated after that. For the concurrent heap, 
the speedup did not saturate until over 11 processors. These 
results are shown in Fig. 3. 

Case II: Inserts 


In this case, each processor performed one insert operation 
in each access-think cycle. A total of 1000 insert operations 
were performed on a heap that initially had 1024 elements. 
Thus, the depth of the heap remained 10 for all the insertions. 
If the values of the inserted keys are randomly distributed, then 
the number of iterations of the heapification loop executed by 
insert_b are very small[5]. On the other hand, insert_t executes 
strictly log(N — 1) iterations (N is the size of the heap). Hence 
for inserting keys with random key values, our concurrent heap 
scheme does not perform better than the serial access heap in- 
sert. (The speedup figures are roughly the same for both cases.) 

In parallel branch-and-bound algorithms|[8,10] the inserted 
keys tend to have small values. For such keys, both insert_b and 
insert_t would execute roughly the same number of iterations. 
To test the performance in this case, we generated keys whose 
values were in the decreasing order. In this case, just as in 
Case II, the speedup of the concurrent heap scheme saturated 


much later than the serial access heap scheme. Fig. 4 shows 
the speedup curves. 
Case III: Two Inserts and One Delete 


In this case, each processor performs one delete and two 
insert operations in each access-think cycle. The heap initially 
has 1024 elements. This case simulates the behavior of a typical 
parallel branch-and-bound algorithm in which each processor 
picks a least cost node from the heap, generates 2 successors 
and puts them back on the heap. In this case, the concurrent 
heap scheme is able to provide a speedup of 13.5, where as the 
serial access scheme saturates at 5. 


6 Related Research 


Biswas and Browne[1] present a scheme, called CHEAP, that 
allows insertions and deletions to proceed in parallel. In their 
scheme, an insert or delete operation is decomposed into a se- 
quence of update steps at different levels of a heap. An auxiliary 
task queue stores the steps of insertions and deletions currently 
in progress. By appropriately scheduling these update steps, 


a set of service processes concurrently perform insertions and 
deletions without causing deadlocks. If enough service proces- 
sors are available, then this scheme can perform insertions and 
deletions in constant time. This approach is not able to per- 
form better than the serial access scheme except for very large 
heaps due to the overheads associated with scheduling window 
updates through the server queue. 

Unlike the scheme in [1], our scheme does not require spe- 
cial server processors to update the heap. Also the number 
of locks needed for each operation are much smaller. Unlike 
their scheme, our scheme also retains the strict priority order- 
ing of the serial access heap algorithms; i.e., a delete opera- 
tion returns the best element of all elements that have been 
inserted or are being inserted at the time the delete operation 
is started. The scheme presented in this paper was motivated 
by the work of Biswas and Browne. Initially, we wanted to in- 
corporate CHEAP in our parallel branch-and-bound algorithms 
- to improve their performance. But experiments conducted by 
Biswas * showed that CHEAP was not able perform better than 
the serial access scheme even for heaps with 1,000 elements. 

Ellis and Gaffar® have developed a scheme that also does 
not require the use of separate special service processors. In 
this scheme, inserts and deletes proceed in opposite directions, 
but avoid deadlock using a “sliding-lock” scheme. Performance 
results of this scheme are not yet available. 

A number of concurrent-access schemes have been devel- 
oped for manipulating dictionaries that are represented as bal- 
anced trees[4,7], B-trees [3], and the balanced cube[2]. Most 
of these concurrent schemes allow O(log NV) operations (delete 
the smallest key, delete a key, insert a key, search for key, etc.) 
to be done simultaneously. A major exception is the balanced 
cube which permits O(V) search, insert and delete operations 
to done concurrently. However, even the balanced cube permits 
only O(log NV) operations “delete-the-smallest-key” operations 
at a time. In a priority queue, the only operations of interest 
are “delete-the-smallest-key” and “insert-a-key”. For these op- 
erations, on a sequential processor, the heap is clearly a more 
efficient data structure than B-tree, balanced trees and the bal- 
anced cube. Since our concurrent-access heap scheme has the 


4Personal communication 
°Private communication with Carla Ellis 
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same degree of concurrency as others and has smaller overhead, 
it is better than other concurrent schemes for manipulating a 
strict priority queue. 


7 Conclusions 


We have presented a new concurrent heap access scheme that 
has small overhead, and is able to perform better than the se- 
quential heap even for small heaps. The insert and delete op- 
erations of this scheme keep the heap balanced; hence each 
operation still takes O(log N’) steps, where N is the size of the 
heap. The scheme also retains the strict priority ordering of 
the serial access heap algorithms; i.e., a delete operation re- 
turns the best key of all keys that have been inserted or are 
being inserted at the time delete is started. In this scheme, 
O(log N) processes can manipulate the heap simultaneously. A 
detailed analysis of the expected performance is reported in [9], 
where we also discuss a number of possible improvements that 
can be made to reduce the overhead of the scheme. We have 
incorporated the concurrent heap scheme in a parallel branch- 
and-bound algorithm for solving the traveling salesman prob- 
lem, and have obtained much better speedups than with the 
serial access schemes|6]. 

Note that even in the concurrent-access heap scheme, at- 
most O(log N) processors can manipulate the heap concurrently. 
To allow greater concurrency, it seems necessary to relax the 
strictness of the priority queue. In [6], we present several “dis- 
tributed” formulations of priority queue that permit O(N) con- 
currency, and test their effectiveness in parallel branch-and- 
bound. 


Acknowledgements: We would like to thank Jit Biswas for 
many useful discussions concerning CHEAP. 
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Figure 3: Plot of speedups obtained in execution of ac- 
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Figure 4: Plot of speedups obtained in execution of ac- 
cess-think cycles for insert operation. The numbers in- 
serted are in a decreasing order. 
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Figure 2: An example of how the insertion path is com- 
puted in insert_t. A new key is inserted into the heap 
at node 13. I = 5 = (101) in the binary representation; 
length of the insertion path = 3. 
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CONVOLUTION ON SIMD MESH CONNECTED MULTICOMPUTERS 
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Abstract 

Convolution is an important primitive in computer vision and 
image processing. In this paper, we present efficient and optimal Control Program 
algorithms for convolution on a mesh connected computer. Our Unit Memory 


algorithms do not assume any broadcast feature for data values as 
assumed by previously proposed algorithms. 
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The inputs to the image template matching problem are an 
NXN image matrix /[0..N —1,0..N —1] and an MxM template 
T(0..M —1,0..M —1]. The output is an NXN matrix C2D where 


C2D[i, j]= 31 SU[(G +a) mod N, (j +0) mod N] * Tlu,v] 


| 
4=O v=0 
0 < i, ; <N —}—+ Memory 


C2D is called the two dimensional convolution of J and T. Tem- | 
plate matching, i.e., computing C2D, is a fundamental operation | 
in computer vision and image processing. It is often used for edge | 
and object detection; filtering; and image registration [ROSE82, | 
BALL85]. Because of the fundamental nature of this problem and | 
because of its high complexity (O(M’?N’) on a single processor com- | 
| 
| 
| 
| 


Processing 


| 
| 
| 
| 
| 
1. INTRODUCTION | 
| 
| 
| Element 
| 
| 
| 


WHOdtoyP FPomeraen poate 


puter), much attention has been devoted to the development of 
efficient fine grain multicomputer parallel algorithms. For exam- 
ple, Chang, Ibarra, Pong and Sohn [CHAN87] have studied this 
problem on an SIMD pyramid computer; Fang, Li and Ni 
[FANG85], Fang, Li and Ni [FANG86], Prassana Kumar and 
Krishnan [PRAS86], and Ranka and Sahni [RANK87a] and 
[RANK87b] have considered it on a hypercube multicomputer; 
Fang, Li and Ni [FANG86] have considered perfect shuffle multi- 
computers; Kung and Song [KUNG81] have considered systolic 
arrays; and Lee and Aggarwal [LEE87], and Maresca and Li Figure 1: An SIMD multicomputer 


[MARE86] have considered mesh connected computers. 
1 


Processing 
Element 


Memory 


In this paper, a parallel algorithm for 2-D convolution is 
presented for an SIMD mesh connected multicomputer. Our algo- 
rithm differs from those of [LEE87] and [MARE86] in that our algo- 
rithm does not use any broadcast of data values. Further, the 
amount of result value movement in our algorithms is an order of 
magnitude less than in the algorithms of [LEE87| and [MARE86]. 
Thus when the size of the image and template values is small (e.g, 
binary images and templates) as compared to the convolution 
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values, our algorithms will be more efficient. 
Section 2 describes our computer model. In addition, notation 
and some fundamental data movement operations are developed in 
this section. In Section 3, we develop fine grained algorithms for : 
one dimensional convolution. These form the basic component of 2 4 
our two dimensional convolution algorithms which are developed in : 2,2 2,3 
Section 4. 
2, PRELIMINARIES ee) eel 
‘ 3, 1 
2.1. Mesh Connected Multicomputer : 3, 2 3,3 


+ : ’ ; . 
This research was supported in part by the National Science Foundation 
under grants DCR84-20935 and MIP 86-17374 Figure 2: A 4 < 4 mesh connected computer 
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A block diagram of an SIMD mesh connected multicomputer 
is given in Figures 1 and 2. The important features of such a mul- 
ticomputer and the programming notation we use are: 


1. There are P X P processing elements connected together via 
a mesh interconnection network (to be described later). Each 
PE has a unique index (0..P-1,0..P-1). Sometimes we will use 
a one dimensional indexing of the mesh. This is obtained 


using the standard row major mapping in which (t, 7) is 
mapped 7P +7. The local memory of each PE can hold 
data only (i.e no executable instructions), Hence PEs need be 
able to perform only the basic arithmetic operations (i.e., no 
instruction fetch or decode is needed). Throughout this 
paper, we shall use brackets({ |) to index an array and 
parentheses(’()’) to index the PEs. So, A[?, j] refers to 7,7’th 
element of the matrix A while A(z, j) refers to the A register 
of PE(?, 7). 


2: There is a separate program memory and control unit. The 
control unit performs instruction sequencing, fetching, and 
decoding. In addition, instructions and masks are broadcast 
by the control unit to the PEs for execution. An 
instruction mask is a boolean function used to select certain 
PEs to execute an instruction. For example, in the instruction 

A(t, 7) :=A(?, 7) +1,  (¢ mod 4=0) 
(2 mod 4 =0) is a mask that selects only those PEs whose 
row index satisfies this property. Le, all PEs with indices 
which are multiples of 4 increment their A register by 1. 


3. The topology of a 16 node mesh connected computer is shown 
in Figure 2. A P x P mesh contains P” PEs. PE(7, 7) is con- 
nected to PE((t-1) mod P,j),  PE((¢+1) mod P, 3), 
PE(2, (7-1) mod P) and PE(?, (7+1) mod P). 

4. Interprocessor assignments are denoted using the symbol +, 
while intraprocessor assignments are denoted using the sym- 
bol :=. Thus the assignment statement: 

B((t+1) mod N, j) —B(i, 7) 
implies that each processor transmits its B register value to 
the B register of the processor on its right. 


5. Ina unit route, data may be transmitted from one processor 
to another only if the two are directly connected. We assume 
that the links in the interconnection network are unidirec- 
tional. Hence at any given time, data can be transferred 
either from PE((¢+1) mod N, 7) to PE(?, j) or vice versa. 


6. Since the asymptotic complexity of all our algorithms is 
determined by the number of unit routes, our complexity 
analysis will count only these. 


2.2. Basic Data Manipulation Operations 


In this section, we develop algorithms for some basic opera- 
tions on a one dimensional array. Such an array with P PEs has 
the topology shown in Figure 3. The PEs are indexed 0 through P-1 
left to right. 


2.2.2. Data Accumulation 


For this operation, PE j has an array A[0..M —1] of size M. 
The notation A[i|(j) refers to A[i] in PE j. In addition, each PE 
has a value in its I register. After the data accumulation, the M 
elements of A in PE 7 are such that: 
Alt\(7) =I((j +7) mod P), 0<t<M, 0<j <P 
This operation can be performed in (M-1) unit routes using 
the algorithm given in Figure 4. 
Be te et 
procedure ACCUM(A, I, M) 
{each PE accumulates in A, the I values of the next M PEs 
including itself} 


begin 
A[0] :=1; 
fori :=1 to M-1 do 
begin 
SHIFT(I, -1); 
Ali] :=I,; 
end 
end { ACCUM} 


Figure 4: Data accumulation 


2.2.3. Adjacent Sum 
For each PE, p, 0<p<?P, the sum 


M-i 
T(p) = dj Al[i]((p +1) mod P)) (1) 


is to be computed. M < P is a parameter to the operation. This 
can be performed in 2(M -1) unit routes using the algorithm of Fig- 
ure 5. The strategy here is that each processor initiates a T value 
that circulates through the M processors containing its A terms (cf. 
eq(1)). Once the M terms have been accumulated, the T values 
need to move back to the originating PEs. This requires a clock- 
wise shift of M-1. 
a a a pe ee 
procedure AdjacentSum(A, M) 
begin 
T:= A(0); 
fori :=1 to M-1 do 
begin 
SHIFT(T, 1); 
T :=T +Aiij; 
end 
SHIFT(T, -M +1 ); 
end {of AdjacentSum} 


Figure 5: Adjacent Sum 


Figure 3: One dimensional array 


2.2.1. Shift 


SHIFT(A,7) shifts the A register data circularly counter 
clockwise by 7. It can be performed in | il unit routes unless P = 2. 
In this case, a shift of +1 requires 2 unit routes because of the 
assumption of unidirectional links. For convenience, we shall hen- 
ceforth assume P >2. 


213 


3. ONE DIMENSIONAL CONVOLUTION 


The inputs to the one dimensional convolution problem are 
vectors [[0..N —1] and T[0..M —1]. The output is the vector C1D 


where: 


MA 
C1ID[i] = SUI[(¢ +0) mod N]*T{v] 


=O 


, 0S vu <N 


We use the computation of C1D as a basic step in our algo- 
rithms to compute C2D. In this section, we develop algorithms for 
C1D on a one dimensional processor array. We consider the two 
cases: 


(i) Each PE has O(M) memory 
(ii) Each PE has O(1) memory 


Our algorithms assume that the controller cannot broadcast 
data values to the processors. There are P=N processors and the 
vector I is mapped onto the one dimensional processor array such 
that processor p contains J{p]. Further, there are N/M copies of 
T in the processor array with one copy in each block of M proces- 
sors (processors (MM +j,0< 7 <M form a block for each 7, 
0<2<N/M). Within a block, the mapping of T is the same as 
that of I. The case when N=16 and M=4 is given in Figure 6. The 
first row of this figure gives the processor index. 


Oo 1 2 8 4 5 6 7 
i ye Wn He Sy ale Fe. 2k 
te. he i Te ae 


3.2. O(1) Memory 


When only O(1 ) memory per PE is available, we begin by 
first pairing I values in the processors. The pair in processor p is 
(A(p), B(p)) = (Z[(q +2k) mod NJ, I[(q+2k+1) mod N]) where 
q = |P/MIM sal k =p mod M. Figure 8 gives the initial AB 
pairs in each PE for the case N =16, M =4. This pairing is 
easily obtained in M unit routes. Once the AB pairing has been 
obtained, the C1D may be computed by rotating the AB and T 
values sibckuise: Throughout the algorithm, the product of A(p ) 
and T(p) will give one of the terms needed to compute 
C1D(p),0 < p <P. Bip) will be the next I value needed. Initially, 
this is true for all processors except those with p mod M=M —1. 
This situation is remedied by replacing B with I in these processors 
to get the first column labeled AB’. Following a rotation of AB, we 
get the second column labeled AB. Now, the B value in processors 
with p mod M=M —2 needs to be changed to I(p). With this 
insight, one arrives at the algorithm of Figure 9. Its correctness is 
easily established. The number of unit routes (including those for 
pairing) is at most 3M. 


Figure 6: Initial configuration for 1-D convolution 


3.1. O(M) Memory 


When each processor has O(M) memory, the most effective 
way to compute C1D is to first perform a data accumulation on I 
(Figure 4). Following this, each processor has all the I values 
needed to compute the corresponding entry of C1D. Next, the T 
values are circulated through the M processors. During this circula- 
tion, the T values are multiplied by I values and the C1D values 
computed. It is worth noticing that, unlike the algorithms of 
[LEE87| and [MARE86], no results are moved in our algorithms. 
Thus when the size of the results is much larger than the size of 
the template or image values, our algorithm will perform better. 
The algorithm is presented in Figure 7. The total number of unit 
routes is 2M -1. Note that while the last iteration of SHIFT(T, -1) 
is unnecessary for the computation of C1D, it restores the original 
T values. This is required by our C2D algorithm. 


procedure C1D_M(M) 
{O(M) memory algorithm for one dimensional convolution} 
begin 

ACOUM(A, I, M); 


C1D :=0; 
for} :=0 to M-1 do 
begin 


C1D :=C1D +Al[(j +p) mod M * T; 
SHIFT(T, -1); 
end 


end; {of C1D_M} 


Figure 7; O(M) memory computation of C1D 


4. TWO DIMENSIONAL CONVOLUTION 


We assume that C2D is to be computed on an N X N mesh 
connected SIMD multicomputer. Further, we assume that /[i, j| is 
initially in the I register of PE(i, 7), the result C2D is to be com- 
puted such that C2DJ[e, 7] is in the C2D register of PE(i, 7), and 
that N is a multiple of M. Thus the N X N array may be viewed 
as composed of N /M arrays each of size M X M (Figure 10). We 
also assume that processor PE(t, 7) contains 
T|? mod M, 7 mod M] in its T register. 


4.1. O(M) Memory 


In this case, PE(i,j), 0<¢<N, 0<j<N first computes M one 
dimensional ee S|q|, 0<q <M as defined below 


eas (j +r) mod N)*T{q, 7] 


Next, C2D is gies by performing an adjacent sum opera- 
tion along the columns of the N x N PE array. A high level 
description of the algorithm is given in Figure 11. The total 
number of unit routes is M” +O(M). Notice that result movement 
is restricted to adjacent sum, which takes O(M) time. The algo- 
rithms of [LEE87] and [MARE87] require O(M’) result movement. 


4.2. O(1) Memory 


We develop two algorithms for this case. The first is concep- 
tually simpler but requires 4M + O(M) unit routes. The second 
requires only 2M” + O(M) unit routes. For the first algorithm, we 
rewrite the definition of C2D as 


C2D{i, j] = SCXD{i, r, 9] 


r=0 
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i J AB T AB! AB T AB! AB a AB! AB T AB! 
0 Ib, Te. lel, Id, To. abds TT Te. otels Las Vo Tele 
ee Its T, ILols Int, De. ‘daly I, 1; ie. “daly Lis Te. ts 
2 I, Lyls PT, [gls Is Ts slo Ios Ty [ols Ie, Ty Lgl, 
3 1s, Igly T; [els II, To 344 Ils T, als [56 LT, Isle 
4 1, Igl, T, I, II T, Isl, Il, T, Lol, II, Lo iy 
5 ts Agly qT, ely TE T, ae Igly Ts als EE Ty Isl, 
6 Is Lgl, LT, alg IgLyo Ty Lyle Lely Ty Lely IyTg T,  Iyls 
7 dy Fooly Ts  Tiolr — Irds Ty Iylg Igly T, Lgl, otto PT, I glio 
8 Ig — Igly Ty) [gly Ighio Tr [gly Tol Te tots itn = T3 iid 
9 Ig Tilt, Ty Nola whe Te Tite ties Ts tials Ighio Ty Dolio 
WO Ly Tityg Te Ties Lisa Ts Sisto Siok To Foolty Site Ti Lilie 
Woody Tiglis Ts Titi Livte2 «To in tots oT Fotis Lista TT. ists 
2 Ty Tyolis To Ties Tisha Ti isla Lisle T, tials — Lislo Ts, Lisl 
3 Tyg Tigdlis Ty uals Lista T,  Lyslo Lod T,  Iolig  Lislig To Sizlia 
140 Tyg Ighy PT, [oly Il, Ts Titi Talis To uals Lislo Ty, Lislo 
16 1,3 Lols T3 Tels Nislo To Lislo Lol DP, ody Il, PT, Ly, 
Figure 8: Execution Trace N = 16,M =4 
erste nate he ee te ee 
procedure O1D_1 (M) 
{O(1) memory C1D algorithm} 
begin procedure C2D_M(N, M) 
oe { obtain initial AB pairs} (sami ON aietiotpe PE 
for j mer M-1 do SteplL: Regard the N x N mesh as N one dimensional array 
begin processors. Fach row forms one such array. Perform a 
B(p) :=I(p); {p mod M = M-1-j) data acumulation on I. Now each PE contains the M I 
C1D :=C1D +A * T; values it needs to compute its S(q)’s. 
SHIFT(A, -1); Step2: Compute the S/q|’s. Each S[q] is a one dimensional con- 
C :=B; B :=A; A:=C; {Exchange A and B} 


volution. However, the data accumulation step of the 
algorithm of Figure 7 may be omitted as the I values 
end; have already been accumulated in Step 1. To go from 
end; { of C1D_1} one $ to another, the T values need to be shifted along 
the columns of each M & M block. This can be done us- 
ing the vertical inter PE connections. 
M-1 


Step3: Compute C2D(i, j) = 3) S[r]((i +r) mod N, 7) This is 


SHIFT(T, -1); 


Figure 9: O(1) memory computation of C1D 


r=O 
N done using the adjacent sum algorithm of Section 2 on 
‘ <j the columns of the N X N PE array 


end 


Figure 11: High level description of two dimensional convolution 
with each PE having O(M) Memory 


procedure C2D(N, M) 
{O(1) memory C2D algorithm} 
Step1: Repeat Steps 2, 3 and 4 for q :=0 to M-1 


Step2: PE(i +7, 7) computes C1ID(i+r,j7) =CXD{t, r, J] 
where ¢ mod M=q andO< r <M. This is done using 
procedure C1D_1 of Figure 9. 


Step3: PE(?, 7) gor t mod M =q computes 
C2D(i, j) = 4} CXD[(i +r) mod N, j] by repeatedly 


r=O 
shifting the C1D values up the columns of the processor 
array. 


Step4: T(i, 7) —T((@ +1) mod N, 3 
Figure 10: A N X N array viewed as N?/M? Pp (7, 7) (( ) ) 
M X M arrays end; {of C2D} 
Figure 12: O(1) memory computation of C2D 
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M-1 
= SiI[(i +r) mod N,(j +a) mod N]*TI[r, a] 

a=O 

A high level description 1 is provided in Figure 12. The number 

of unit routes is 4M” + O(M). In the q’th iteration of Steps 1 and 
2 of this algorithm C2D(7, 7) is computed for all PEs with 
t mod M =q. This is done by first having PE(7 +r, 7). Compute 
CiD(i +r, js) =CXD{ti, r, 7] for « mod M=q and 0O<r <M. 
Next C2D(i,7) is computed by summing CXD{i,r, Jj] for 
t mod M=q andO< r <M. This summation is done only in PEs 
(¢,7) with ¢ mod M=q and requires shifting the CXD{t, r, J] 
values up along the columns. Each iteration of Steps 2 and 3 takes 
4M + O(1) unit routes. Thus the unit route count for the entire 
algorithm is 4M? + O(M). Note also that this algorithm requires 
M” —M movement of the C1D values. 


The strategy for the second algorithm is similar to that used 
in computing C1D when only O(1) memory is available. We may 
rewrite the definition of 3d as 


C2D{i, j] = S4Xt, 9,7] * Yir] 


r=0 
where X(t, 7, 7] is the 1 x M vector 
I[(t+r) mod N, 7 .. (g+M—1) mod N] and Y/r] is the 1 X M vector 
T(r, 0.. M-1]. Thus C2D is viewed as the one dimensional convo- 
lution of X and Y where each term X and Y is a vector. The 
algorithm is presented in Figure 13. Figure 14 shows the initial AB, 


CxXD|t, r, 7] 


Ili 12, Al A2, and B1 B2 pairs created by the first four PAIRING 
apie when N=8, and M=4.The total number of unit routes is 

+ O(M). Unlike the first algorithm, there is no result move- 
se in this algorithm. 


procedure C2D_1(N, M) 
{ assumes O(1) memory per PE} 


begin 

PAIRING(M) along rows of J; {obtain AB pairs} 
PAIRING(M) along columns of I; {obtain I1 I2 pairs} 
PAIRING(M) along columns of A; {obtain Al A2 pairs} 
PAIRING(M) along columns of B; { obtain B1 B2 pairs} 
C2D :=0; 
fora :=0 toM-1 do 
begin 

A2(i, j) == A(i, j); (i mod M =M-1-a) 

B2(i, j) := Bi, j); (i mod M =M-1-a) 

12(i, j) -=I(i, j); (1 mod M =M-1-a) 


AA :=AI1 ; BB :=BI; 
forb :=0 to M-1 do 


begin 
BB(i, j) :=I1(i, 3); (j mod M =M-1-b) 
C2D(i, j) = C2D(i, i) +A, i) * TG 3) 
SHIFT(AA, -1) along rows; 
C := BB; BB := AA; AA :=O; 
SHIFT(T, -1) along rows; 

end 


SHIFT(A1, -1) along columns; 
SHIFT(B1, -1) along columns; 
SHIFT(I1, -1) along columns; 
C :=Al; Al :=A2; A2 :=C; 
C :=B1; B1 :=B2; B2 :=C; 
CO :=11; 11 :=12; 12 :=C; 
SHIFT(T, -1) along columns; 
end; 


end{ of C2D_1} 


Figure 13: Two dimensional convolution 


with each PE having O(1) Memory 


i 
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5. CONCLUSION 


In this paper, we have developed optimal algorithms for 1-D 
convolution and image template matching (2-D Convolution) on a 
mesh connected SIMD multicomputer. None of our algorithms. 
require a data broadcast. Further, our algorithms require less or no 
movement of results. Hence, these algorithms will be more efficient 
when the size of the image and template values is small as com- 
pared to the size of the convolution values. 
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Abstract 


This paper describes several parallel algorithms that solve geometric 
problems. The algorithms are based on a vector model of computation 
— the scan-model. The purpose of this paper is both to illustrate how 
the model can be used and to describe a set of simple algorithms. 

We describe a k-D tree algorithm that, for n points, requires O(lg n) 
calls to the primitives, a line-drawing algorithm that requires O(1) calls 
to the primitives, a line-of-sight algorithm that requires O(1) calls to the 
primitives, and finally two convex-hull algorithms. All these algorithms 
should be noted for their simplicity rather than complexity; many of 
them are parallel versions of known serial algorithms or variants of 
known parallel algorithms. 

Most of the algorithms discussed in this paper have been implemented 
on the Connection Machine, a highly parallel single instruction multiple 
data (SIMD) computer. 


1 


The purpose of this paper is twofold. Firstly, it describes a set of ele- 
gant, practical algorithms for solving a diverse set of problems in com- 
putational geometry and graphics. Secondly, it helps demonstrate that 
the scan-model is a viable model of computation. These two purposes 
complement each other: the model allows a simple description of the 
_ algorithms, and the algorithms demonstrate the power of the model. 

Researchers have suggested several synchronous parallel models of 
computation. The most popular of these models are the parallel ran- 
dom access machine (P-RAM) models [13]. A P-RAM consists of a set 
of conventional processors attached to a single shared memory. Proces- 
sors communicate through the shared memory: one processor can write 
a value into the memory and another processor can read this value. 
Researchers have suggested several variations of the P-RAM models. 
These variations mostly differ in whether or not they permit concur- 
rent reads from, or concurrent writes to, a unique memory location. 
By assuming that memory references take unit-time, the P-RAM mod- 
els have been used to determine the asymptotic running time of many 
parallel algorithms. 

We suggest another class of synchronous parallel models of compu- 
tation defined in terms of a set of primitive operations that operate on 
arbitrarily long vectors of atomic values. We call these models vector 
models [8]. The models differ from P-RAM models both in that they 
are single instruction multiple data (SIMD) models, and in that there 
is no concept of a memory shared among many processors. Elements 
in a vector communicate through a permutation primitive rather than 
a shared memory. As with the P-RAM models, vector models can be 
used to analyze the asymptotic running time of algorithms, by making 
assumptions about the relative running times of the primitives. 

Since vector models are SIMD, they can be efficiently mapped onto 
a wider range of architectures than P-RAM models can. As well as 
being implementable on standard serial computers and on multiple in- 
struction parallel computers, they can be efficiently implemented on 
vector processors, such as the vector processor of the CRAY systems 
[24], or single instruction parallel computers, such as the Connection 
Machine [16]. On the other hand, since P-RAM models are multiple 


Introduction 


°* This report describes research done within the Artificial Intelligence Labora- 
tory at the Massachusetts Institute of Technology. Support for the A.I. Laboratory’s 
artificial intelligence research is provided in part by the Advanced Research Projects 
Agency of the Department of Defense under Army contract DACA 76-85—-C-0010 and 
in part under the Office of Naval Research contract N00014—85—K-01 24. 
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instruction multiple data (MIMD) models, they are more powerful than 
vector models. As should become evident in this paper, and as argued 
elsewhere [8], this additional power is not necessary for a broad range 
of practical algorithms. We also believe that vector models tend to lead 
to simpler and more concrete algorithm descriptions than do P-RAM 
models. 

The scan-model is a particular vector model. The name comes from 
the inclusion of a scan, also known as prefix, primitive. In this paper 
we describe several algorithms based on the scan-model. Most of the 
algorithms we describe in this paper have been implemented on the 
Connection Machine. All the algorithms described in this paper are 
described in more detail in [10,8]. Before describing the algorithms, 


we define the scan-model and introduce some techniques based on the 
model. 


2 The Scan-Model 


The scan-model [8] is defined in terms of a set of primitive operations 
that operate on arbitrarily long vectors of atomic values. By a vector 
we mean a one dimensional array (an ordered set). By atomic values 
we mean values that can be represented in O(lign) bits — in this paper 
we only use integers, floating point numbers and boolean values. We 
assume that the primitives require approximately an equivalent dura- 
tion of time when operating on equal length vectors. The scan-model 
has three classes of primitives: elementwise arithmetic and logical oper- 
ations, permutation operations, and scan operations, a type of parallel 
prefix computation. 

Each elementwise primitive operates on equal length vectors, produc- 
ing a result vector of equal length. The element i of the result is an 
elementary arithmetic or logical primitive — such as +, —,*, or and not 
— applied to element i of each of the input vectors. For example: 


A = [5 1 3 4 3 9 2. 6] 
B = {2 5 3 8 1 3 6. 3] 
A+B = [7 6 6 12 4 12.8.) 8] 
AxB = [10 5 9 24 3 27 12 = 12] 


The permutation primitive takes two vector arguments — a data vec- 
tor and an indez vector — and permutes each element in the data vector 
to the location specified in the index vector. For example: 


Vector Index = ([0- “Lb 2 “8 ~4° -h <6. 4 
A (data vector) = [o t e m er g “yi 
I (index vector) = [2 5 4 3 1 6 O 7 
permute(A, I) = [g e o m _e t fr yl 


It is an error for more than one element to have the same index 
— the permutation must be one-to-one. This restriction is similar to 
the restriction made in the exclusive read exclusive write (EREW) P- 
RAM model. To allow communication between vectors of different sizes, 
we include a version of the permute primitive that returns a vector of 
different length than the source vectors by masking out elements or 
putting in defaults. 

The scan primitives execute a scan operation, sometimes called a 
prefix computation, on a vector. The scan operation takes a binary 
associative operator @, and a vector (ao, a), ...,@n_1]| of n elements, and 
returns the vector [ao, (a9 @a}), ..., (@0 ai @...Pan_1)]. In this paper we 
will only use plus, maximum, minimum, or and and as operators for the 
scan primitives. We will henceforth call these scan operations +-scan, 
max-scan, min-scan, or-scan and and-scan. Some examples: 


A - 6 1] 2B 4 3 9} (2 6] 
B = f1 o} 2 0 3 4 fp q 
+-scan(A) = [5 6] [3 7 10 19} [2 8} 
permute(A, B) {1 5) [4 9 3 3] [2 6] 


Figure 1: Examples of the segmented versions of the primitive operations. 
A = [ 1 3 4 3 9 2. 6] 
+-scan(A) [5 6 9 13 16 25 27 33] 


max-scan(A) [5 5 5 5 5 9 9 9] 


In [9,8] we argue why, in the analysis of algorithms, the scan primi- 
tives should be considered no more expensive (timewise) than the per- 
mutation primitive. The basic argument is that the scan primitives can 
be implemented, both in practice and in theory, to run as fast as the 
permutation primitive. 

In the description of algorithms we will often loosely refer to vectors 
in which each element contains some fixed number of atomic values. At 
the primitive level such a structure vector would be implemented with 
multiple vectors but a higher level language could support record-like 
vectors in which each element has some constant number of values. 


2.1 Segments 


This section describes a method that allows a programmer to take a 
vector routine defined to operate on a single set of data and then apply 
it to many sets in parallel. For example, if we had a vector routine 
that sorted a set of values, we could apply it to sort many sets of 
data in parallel. Or, if we had a vector routine that, given endpoints, 
determines the pixels on a line, we could apply it to draw many lines 
in parallel. 

The technique involves dividing a vector into segments and placing 
one set of data in each segment. To keep track of how a data vector 
is segmented, we associate with the data vector a segment-descriptor. 
A segment-descriptor is itself a vector which has as many elements as 
segments of the data vector; each of these elements contains an integer 
which specifies the length of the segment’. For example: 


A! = [5 1 3 4 3 9 2 6] 
segment-descriptor = [2 4. 2] 
A = [5 1 [3 4 3 9 {2 6 
Henceforth, the notation 
A = [6 1 [8 4 3 9 [2 6] 


is shorthand for a pair of vectors: the data vector along with its segment- 
descriptor. 


For each primitive of the scan-model we define a segmented version 
that operates independently within each segment. Figure 1 illustrates 
examples of segmented versions of the primitives. The segmented ver- 
sion of the permutation primitive bases its indices relative to the be- 
ginning of each segment so values permute within a segment — it is 
an error for an index to reference outside of the segment. The seg- 
mented version of the scans primitives restart at the beginning of each 
segment”. The segmented version of the elementwise operations are 
unchanged. All the segmented versions can be simulated with a small 
constant number of calls to the unsegmented versions [8]. 


The Segment Lemma: With a segmented version of all the prim- 
itives of the scan-model, we can apply any routine defined in terms 
of those primitives to operate on a single set of data, to multiple 
sets of data independently and in parallel. 


We won’t prove this lemma in this paper, but it should be intuitive; 
a proof is given in [8]. This lemma allows great simplification of the 
code needed to describe parallel algorithms. 

1There are several other ways of representing segments [8] but we find this rep- 


resentation the most convenient. 
2A similar operation was suggested by Schwartz [26]. 
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A = [7 3 8] 

L - [2 4. 32] 

I = (0 2 1] 

B = [> 1 [3 4 3 9] [2 6 

Cc = fi o}) {2 1 3 oOo] [fo 1] 

F = (T F| [T F F T] (T T] 
distribute(A, L) = (7 7] [3 3 3 3] [8 8] 
index(L) = [0 1) {0 1 2 3) [oO J 
element(A, I) = \16 3 6] 

+-reduce(B) = [6 198} 

max-reduce(B) = {5 9 6] 

pack(B, F) = [5] (3 9] [2 6) 

split(B, F) = fi} [5) {4 3] [3 9] 1 [2 6) 
delete-split(B, F) = [1] [5] [4 3) [3 9] [2 6] 
rank-split(C, F) = [0] [0] [0 1) {1 9] (] [oO 1] 


Figure 2: Examples of a set of simple operations based on the primitives. 


2.2. Some Simple Operations 


In this section we describe several useful, simple operations that can 
be implemented with a small constant number of calls to the primitive 
operations [9]. As with the segmented versions of the primitives, these 
operations are useful enough that they might themselves be considered 
primitives and be implemented directly. Many of these operations are 
similar to primitives of APL [18]. Figure 2 illustrates examples of each 
of the operations. 

The distribute operation takes a vector of values and a vector of lengths 
and distributes each value into a segment of length specified by lengths. 
The index operation takes a vector of lengths, creates a segment for 
each length, and returns the index of each element within each seg- 
ment. The element operation takes a segmented vector values, and a 
vector of indices with one element per segment. Each index 7 is used 
to extract the i‘? element from the corresponding segment in values. 
The reduce operation takes a segmented vector of values and combines 
all the elements in each segment using one of five binary operators: 
+, maximum, minimum, or or and. It returns a vector with as many 
elements as segments. 

The append operation takes two segmented vectors of values with the 
same number of segments and appends the two vectors segmentwise. 
The pack operation takes a segmented vector of values and a segmented 
boolean vector of flags, and packs all the elements with a T in their 
flag into consecutive elements, deleting elements with an F in their flag. 
The split operation takes a segmented vector of values and a segmented 
boolean vector of flags, and packs all the elements with an F in their 
flag to the bottom of each segment and elements with a T in their flag 
to the top of each segment. It also splits each segment in two at the 
boundary between the T and F elements. We also define a delete-split 
operation which is the same as split but deletes any empty segment. The 
rank-split operation is similar to the split operation except that the ranks 
argument must be a valid set of indices for the permutation primitive. 
As well as splitting these indices, the rank-split operation renumbers 
them so they are valid within the new segments but maintain the same 
order. 


2.3 Recursive Splitting 


The segment abstraction and the primitives we described allow simple 
definitions of recursive routines that start with some set of values, split 
this set into subsets and recursively solve the problem on each subset. 
We will call this technique recursive splitting. As an example of such a 
technique, consider the following parallel version of quicksort. As with 
the serial algorithm, the algorithm picks one of the keys as a pivot value, 
splits the keys into two sets, one with greater valued keys and one with 
lesser valued keys, and then recurses on each set. 

The parallel version picks a random element from each segment as a 
pivot value using the element operation®. The algorithm distributes this 
pivot value over each segment using a distribute operation, and splits 
the keys based on whether a key is greater or less than the pivot using 
the delete-split operation*. The algorithm is now applied recursively to 


3T assume that there is a primitive elementwise random operation which in each 
element takes an integer and returns a pseudo-random number less than that integer. 
*We use the delete-split operations instead of the split operation so that we never 


y. 
point [ P} 
x-rank —= [0 6151072 4121481331511 9] 
y-rank = [137 4 3 156110 9 8141102 5 12] 
above-splii-line? [F FT T FFF T TTT FF FT Ti 
rank-split x-rank = (0 6 7 2 4 31 5] [7246053 J] 
rank-split y-rank = [6 3 7 2 5 O 4 1] (2 105 4 7 3 6] 


Figure 3: An example of a 2-D tree. The top diagram shows the final 
splitting. The vectors below are generated during the first step — when 
splitting along the line Lz. 


the result. When the numbers within all segment are in non decreas- 
ing order, we return. As with the serial algorithm, this algorithm is 
expected to complete in O(ign) steps®. In the scan-model, each step 
requires a small constant number of operations. 

The code needed to implement quicksort in the scan-model is as fol- 
lows: 


define quicksort(keys){ 
if-any (shift-left(keys) < keys) 
then 
pivots — element(keys, random(length(keys))); 
flags — distribute(pivots, length(keys)) < keys; 
quicksort(delete-split(keys, flags )); 
else keys} 


This general recursive splitting technique can be used in many divide 
and conquer algorithms. In this paper we will use it in the k-D tree 
algorithm discussed in Section 3, and the quickhull algorithm discussed 
in Section 6.1. 


3 Building a k-D Tree 


A k-D tree is technique for splitting n points in a k dimensional space 
into n regions each with a single point [6]. It starts by splitting the 
space in two along one of the coordinates using a k — 1 dimensional 
hyperplane. It then recursively splits each of the subspaces in two. 
Figure 3 illustrates an example of a 2-D tree. At each step the algorithm 
must select which dimension to split within each subspace; the criterion 
for selection depends on how the tree will be used. A common criterion 
is to select the dimension along which the spread of points is greatest. 


The k-D tree is often used as a step in other algorithms. 3-D trees 
are used in ray tracing algorithms for rendering solid objects. In such 
algorithms, objects need only be stored in the regions they penetrate 
and rays need only examine regions they cross. This can greatly reduce 
the number of objects each ray needs to examine. k-D trees are also 
used in many proximity algorithms such as the all closest pairs problem 
[15] or the closest pair problem. k-D trees have also been suggested for 
use in some machine learning algorithms [20]. 


have more segments than elements. 
5This is actually only true if either the keys are unique, or we split into three 
groups at each step (<,=, >), or we switch between < and > on alternating steps. 
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The algorithm we describe here is a parallel version of a standard 
serial algorithm [22]. For n points, our algorithm takes O(klgn) calls 
to the primitives on vectors of length n. This algorithm is optimal in 
the sense that even if simulated on a serial machine, it will run in the 
same asymptotic running time as the best serial algorithm. 

Our algorithm consists of one step per split. Each step requires O(k) 
calls to the primitives. Before executing any steps, the algorithm sorts 
the set of points according to each of the k dimensions. The sort- 
ing can be executed with the quicksort algorithm discussed earlier, an 
enumerate-pack sorting algorithm discussed in [9], or a version of Cole’s 
sorting algorithm [11]. Instead of keeping the actual values in sorted 
order for each dimension, we keep the rank of each point along each di- 
mension. The rank of a point is the position the point would be located 
at if the vector were sorted. We call the vectors that hold these ranks, 
rank-vectors — there is one rank-vector for each dimension. Figure 3 
illustrates an example for a 2-D tree, the initial rank-vectors, and the 
result of the first step. 

At each step of the algorithm the rank-vectors will contain a segment 
for each subspace, and the ranks within each segment will be the correct 
ranks for that subspace. It suffices to demonstrate that we can execute 
a split along any dimension and generate new ranks within the two 
subspaces. The algorithm is then correct by induction. 

To split along a given dimension the algorithm distributes the cut line 
and determines for each point whether it is above or below the line®. 
The algorithm now uses the rank-split operation defined in Section 2.2 ~ 
to split each rank-vector based on whether a point is below or above 
the split line. The rank-split operation as defined correctly generates 
the rank within each subspace. Each step therefore requires O(k) calls 
to the primitives: some operations to determine whether each point 
is below or above the split, and k rank-split operations. Since there 
are O(lgn) steps, the whole algorithm requires O(klgn) calls to the 
primitives. 

A two dimensional closest-pair algorithm can be implemented based 
on the k-D tree algorithm. This algorithm is a parallel version of an 
algorithm of Bentley and Shamos [7] and is described in [10]. For n 
points in a two dimensional space, our algorithm requires O(lgn) calls 
to the primitives using vectors of length O(n). 


4 Line Drawing 


Two dimensional line drawing is the problem of: given a pair of points 
on a two dimensional grid (the two endpoints of a line), determine what 
pixels in a finite resolution grid lie on a line between the endpoints. 
Line drawing is used extensively in practice in generating computer 
images, especially in computer aided design. In this section we describe 
a very simple line drawing routine. It generates the same set of pixels 
as generated by the simple digital differential analyzer (DDA) serial 
technique [19]. The routine takes a small constant number of calls to 
the primitives on vectors at niost as long as the number of pixels in the 
output. Because of the segment lemma (Section 2.1), the routine can 
be used to draw many lines in parallel. The routine we describe has 
been extended by Salem [25] to render solid objects. 

The basic idea of the routine is to calculate the number of pixels in a 
line and allocate a set of vectors of that length with the line information 
distributed across the vectors. Then based on the line information and 
a unique index for each element, the elements can calculate their final 
position on the grid. The number of pixels in a line is one more than 
the maximum of the z and y differences — we will call this number L. 
We distribute one endpoint and the slope of the line across vectors of 
length L using the distribute operation and generate the unique index 
for each position in the vector with the index operation. Based on the 
index, the endpoint, and the slope, we can calculate the position of each 
pixel using some simple arithmetic. This is described in more detail in 


(10). 


5 Line of Sight 


Given an ,/n by ,/n grid of altitudes and an observation point on or 
above the surface, a line of sight algorithm finds all points on the grid 


6 As stated earlier, the method for choosing a cut line will depend on the particular 
use of the k-D tree. 


Figure 4: An example of a line of sight problem. The X marks the 
observation point. The numbers represent the altitude of each contour 
line. The elements visible from the observation point are shaded. 


visible from the observation point. Figure 4 illustrates an example. A 
line of sight algorithm can be applied to help determine where to locate 
potential eyesores. For example, when designing a building, a highway 
ora city dump, it is often informative to know from where the “eyesore” 
will be visible. 

The algorithm we describe requires O(1) calls to the primitives using 
vectors of length O(n). The basic idea is to allocate a segment in a 
vector for every ray that propagates in the plane from the observation 
point, henceforth referred to as X, to a boundary position. Based on 
some calculations on the points in each ray, we can determine if the 
point is visible. 

The algorithm consists of four basic steps. Each point p in the grid 
calculates the vertical angle between the horizontal plane that passes 
through X (the observation point) and the line from p to X. Secondly, 
the algorithms allocates a set of rays — one for each boundary grid 
point — and distributes the angles from each point p in the grid to all 
the rays it belongs to. Each ray is a segment in a vector we will call 
the ray structure. Thirdly, following a ray from X to the boundary, a 
point p is visible if its angle is greater than all the angles that precede 
it in the ray. This can be determined for all points in all rays with 
a single segmented max-scan, and a comparison. Fourthly, visibility 
information is returned back to the grid points. Since a grid point can 
have a position in many rays, the visibility flags are combined using an 
or-reduce. Some permutations are required to distribute the information 
to the ray structures and to reduce it back to the grid; this is described 
in [10]. 

The longest vectors required by the algorithm will be the vectors of 
the copy and ray structures. It is not hard to show that for a /n by /n 
grid, independent of the location of X, these vectors will have length 
2n. 


6 Convex Hull 


The planar convex hull problem is: given n points in the plane, find 
which of these points lie on the perimeter of the smallest convex region 
that contains all points. The planar convex hull problem is probably 
the most studied problem in computational geometry, both because 
it is a simple problem, making it easy to study, and because it has 
many applications — applications range from computer graphics [14] 
to statistics [17]. 

In this section we describe two scan-model based algorithms for deter- 
mining the convex hull of a set of points. The first algorithm, a parallel 
quickhull algorithm, is very simple and likely to perform well in practice 
but is not provably optimal. The second algorithm is more complicated 
and impractical but is theoretically optimal. The algorithm is based 
on a parallel algorithm designed for the concurrent read exclusive write 


(CREW) P-RAM model [1,3]. 


6.1 QuickHull 


This is a parallel version of the quickhull algorithm [22,12]. The 
quickhull algorithm was given its name because of its similarity with 
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Figure 5: An example of the quickhull algorithm. Each vector shows one 
step of the algorithm. The line AP is the original split line. J and N are 
the farthest points in each subspace from AP, and are therefore used for 
the next level of splits. The values outside the brackets are hull points that 
have already been found. 


the quicksort algorithm. Like quicksort, the quickhull algorithm picks 
a pivot element, a point; splits the data based on the pivot; and is 
then recursively applied to each of the split sets. Also like quicksort, 
the pivot element is not guaranteed to split the data into sets with any 
particular ratio of sizes, so that in the worst case, the algorithm can 
require n steps. 

Figure 5 illustrates an example of the quickhull algorithm. The algo- 
rithm first splits the points into two sets with a line that passes between 
the two # extrema — lets call these points / and r. In the scan-model 
this is executed with a few reduce and distribute operations, some ele- 
mentwise arithmetic calculations, and a split operation. 

The algorithm now recursively splits each of the two subspaces into 
two using the following steps. It determines for each point p in the 
subspace the perpendicular distance from the point to the line Ir. This 
can be calculated with a cross product of the lines lr and lp. The 
algorithm selects the farthest point from the line Ir and distributes it 
to all other elements in the subspace — lets call this point t. It should 
be clear that ¢ lies on the convex hull. Points within the triangle lir 
cannot be on the convex hull and are eliminated with a pack operation. 
The point ¢ is now used to further split each segment based on which 
of the two sides of the triangle, /t or rt, they fall. The algorithm is now 
applied to the new segments recursively. The algorithm is completed 
when all segments are empty. 

Each step requires a small constant number of calls to the primitives. 
As with the serial quickhull, for m hull points, the algorithm runs in 
O(lgm) steps for well distributed hull points, and has a worst case 
running time of O(m) steps. 


6.2 /n Merge Hull 


This algorithm is a variation of a parallel algorithm suggested in [1] and 
independently in [3]. Their algorithm is based on the concurrent read, 
exclusive write (CREW) P-RAM model. We cannot use their algorithm 
directly because the scan-model does not permit concurrent access to 
a single value, a necessary part of their algorithm. The variation we 
describes keeps all elements that require the same data in a contiguous 
segment so the data can be distributed using a distribute operation. The 
contribution of our version is showing how the concurrent read operation 
can be replaced by the distribute operation and involves a tree search 
method discussed in the next section. Like the original algorithm, the 
variation we describe runs with O(lg 7) calls to the primitives. 
Unfortunately, we do not have space here to review the CREW al- 
gorithm and we refer the reader to the two papers mentioned above. 
A review of the CREW and a more complete description of our varia- 
tion can also be found in [10]. The only difficult part of converting the 
CREW algorithm to the scan-model is in the step that finds the up- 
per tangent line-segments when merging the ,/n subhulls into a single 


convex hull. The step uses the algorithm of Overmars [21] to find the 
upper tangent line segment for all pairs of subhulls. Overmars method 
executes a binary search alternating between the two subhulls, and re- 
quires O(lgn) time. At the k‘” step of the binary search, an element 
will either go down the left branch, the right branch or will stay still in 
the search tree. 

The merging step cannot be implemented directly on the scan-model 
since each pair of subhulls independently finds the upper tangent-line 
segments using the algorithm of Overmars, and will therefore require 
concurrent reads: several pairs, while executing the binary search, will 
require access to the same elements. To avoid the concurrent read, we 
place each of the sets of ,/n points that belong to the same subhull in 
its own segment. We then use a general binary search method described 
in [10] to execute the binary search. This search will require O(lg n) 
time. The basic idea of this binary search method is to use the split 
operation to split the elements going down each branch of the search, 
and to use the append operation to append the elements that stay at 
a vertex of the tree to the elements that come down from a parent. 
This guarantees that all elements at any vertex of the search tree are 
in a contiguous segment so that the distribute operation can be used 
to distribute information to them. Each element also needs to keep 
a pointer to the matching element in the other subhull to which it is 
trying to find an upper tangent line. 

Our variation of the CREW algorithm runs with the same number 
of calls to the primitives as the original since, as with the original, the 
sort runs in O(lg7) time, and, as argued above, the merge also runs in 


O(lgn) time. 


7 Conclusions 


This paper introduces the idea of a vector model of computation; de- 
fines a particular vector model, the scan-model; and describes several 
algorithms implemented on the scan-model. Since many of the algo- 
rithms discussed in this paper are variants of known algorithms, we 
believe that much of the contribution of this paper is to methodology 
rather than to algorithms. 

We believe that the algorithms we describe are very practical for im- 
plementation on a wide range of architectures, both serial and parallel, 
and should in most cases be almost as fast on a particular architecture as 
algorithm designed specifically for that architecture’. This generality is 
one of the main advantages of the scan-model over the P-RAM models. 
The advantage arises both because the scan-model is a vector model, 
allowing efficient implementations on vector processors and single in- 
struction parallel processors, and because it treats the scan operation 
as taking no more time than a permutation, a realistic assumption for 
almost all architectures. The algorithms described in this paper have 
been implemented on the Connection Machine. 

In more recent research we have been considering the effect of in- 
cluding other operations as primitives. The operation we have found 
most promising is a variation of the merge operation’. This operation 
can be implemented efficiently on a wide range of architectures and 
is useful for many algorithms. To implement the merge operation on 
serial architectures we can use the standard merge operation, and on 
parallel architectures we can use a variation of Batcher’s bitonic merge 
network [5]. Algorithms to construct and manipulate the plane-sweep 
tree data structure [2,1,4,23] can be greatly simplified with a primitive 
merge operation. We have also found the merge primitive useful for 
manipulating sets. We have also considered sorting as a primitive, but 
we find it hard to argue that sorting should be assumed to require the 
same time as a permutation. 

We hope that the paper will help spur further interest in designing 
algorithms for vector models of computation. 
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Abstract -- This paper describes two parallel template 
matching algorithms on an SIMD array processor with a 
hypercube interconnection network. These algorithms 
improve the algorithm proposed by Fang et al[1]. The first 
algorithm proposed in this paper modifies the local address 
computation scheme, so that only one permute-multiply phase 
is needed. The computation time is reduced to half. The 
second algorithm treats window columns the same way as 
rows. Permute-multiply operations in Gray code sequence are 
also implemented among columns. For an N by N image and 
M by M window, the overall time complexity is reduced from 
O(M*+MlogN ) to O(M*+logN ). The trade-off is that the local 
memory size of each PE is increased from M to M?. 


Introduction 


Image template matching is a basic image processing 
operation. A large N by N image G is searched with an M by 
M template (window). The similarity measure between the 
window jand the M by M subimage of G is defined as 
Cy = XD XY Giss jar*Wy where G(i,j) is the upper-left corner 

s=0 ¢t=0 
element of the subimage. In filtering operation, array C is the 
final result. The tremendous computation load (O(N7M72)) 
and the independence of subimages make the template match- 
ing suitable for SIMD machines. 


Several parallel template matching algorithms have been 
proposed [1-2]. Fang’s algorithm [1] uses hypercube inter- 
connection network, because of its popularity in commercial 
products[3-6]. The algorithm on N? PEs proposed in [1]. 
with the time complexity O(M?+MlogN). requires that each 
PE has M local memory locations to store a window column 
and a register initialized to image element G;;. The window 
elements are broadcast to PEs one column at a time. Within 
each image column, image elements in registers are permuted 
according to Gray code, then multiplied with the correspond- 
ing window element in local memory. The computation of 
local address is designed to provide correct correspondence. 
This broadcast-permute-multiply operation above is repeated 
M times, one window column each. 


In this paper, we propose two algorithms improved over 
the above algorithm. Section 2 describes a different scheme 
for computing the local address, so that the inter-PE commun- 
ication time is cut in half. In Section 3, an algorithm is given 
to implement the Gray code permutation to window columns 
as well as to window rows. Since the permutation- 
multiplication is implemented in 2 dimensions instead of one 
dimension, the overall inter-PE communication time is 
reduced to M*+logN . 


The second author is supported in part by the Canadian 
National Science and Engineering Research Council under 
Grant A9198, 
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Local Address Computation 


In this section, we present an algorithm CUBE-N2-1, an 
improved version of CUBE-N2 given in [1], on a hypercube 
network with N? PEs. It is assumed that image G has been 
distributed in N* PEs, i. PE(,j)=Gj. Each column of 
image G is divided to N/M segments, where each segment 
has M consecutive elements. The window elements are broad- 
cast from control memory to local memory, one column at a 
time during executing the algorithm. Also, we assume that 
N=2" andM=2”. 

In the original algorithm CUBE-N2 given in [1], there are 
two phases (phase 2 and 3) to perform permute-multiply- 
accumulate operations. Careful investigation reveals that in 
phase 2 the PEs without "*" mark could also perform multi- 
plication which is not used in generating C;; at its own loca- 
tion but for C;_y4 ;. The algorithm CUBE-N2-1 presented in 
this section has only one phase, phase 2, to perform 
permute-multiply-accumulate steps. Every PE is permitted to 
perform multiplication in phase 2. The result of multiplica- 
tions with mark "*" is accumulated to one register, say C, 
whereas the result of multiplications of the other PEs without 
"*" mark is accumulate to another register, say B. After phase 
2, the partial result in B is rotated one segment up and added 
to C to get the final result of C,;; at each PE(i,j). In order for 
all PEs to perform correct multiplications, a new Lemma 3 is 
presented below for local address computation. Since the 
address may become negative, to obtain correct local 
addresses, the window column is viewed as a circular array. 


Lemma 3: After executing A(p“)<—A(p), the MAR(p) is 
incremented by 2* if p,=0, decremented by 2* if p,=1. 


The algorithm CUBE-N2-1 is given below. This algo- 
rithm computes C (i,j) = C,;, OSi,jsN-1. The procedure uses 
three registers, A, B and C. It is assumed that A(i,/) is ini- 
tialized to G;;. B and C are used to accumulate the partial 
results during execution and initialized to 0. The PEs are 
indexed by p or (i,j) with p=iN+j. A data movement exam- 
ple for N=32 and M=8 is given in Fig. 1. 


procedure CUBE-N2-1(A,B,C) 


begin 
1 C(p):=0; 
2 B(p):=0; 
3. for t:=0 to M—1 do begin ~ 
4 MAR (p ):=0; 
5 for s:=0 to M—1 do 
6 CMAR :=a+t*M +s ; 
7 M [MAR (p ))<=CM [CMAR ]}, 
8 MAR (p):=MAR (p +1; 
9 end for; 


10 MAR (p ):=0; 

11 C (p):=C (p )+A (—p )*M [MAR (p )]; 
12 PHASE-2; 

13 A(p er") <—A (p); 

14 ROTATE(A ,0,n—1,0); 

15 _ end for; 

16 ROTATE( ,n+m,2n-1,0); 

17 C(p):=C(p)}tB (Pp); 


procedure PHASE-2 


begin 
1 MAR(p):=0; 
2 fork:=n ton+m-—1 do begin 
3 A(p)<—A(p); 
4 MAR (p ):=(MAR (p 4-2") mod M (p,=0); 
5 MAR (p ):=(MAR (p )-2*") mod M (p,=1); 
6 C(p):=C (—p )+A (:p )*M [MAR (p )] (p,=0); 
7 B(p):=B (p +A (p)*M [MAR @)] (= 1) 
8 F :=false; 
9 U:=k—-n; 
10 Gray(U ,F ); 
11 end for 
end 
procedure GRAY(U,F) 
begin 
if U=O0 then return else 
begin 
1 flag :=true; 


2 GRAY(U-1,flag); 
3 AGE") <—A(p); 
4 ifF then 
5 MAR (p ):=(MAR (p )+2"~) mod M (py 4n-1=0); 
6 MAR (p ):=(MAR (p )—2""") mod M (py 4n-1=1); 
7 ~~ else 
8 MAR (p ):=(MAR (p )-2"~") mod M (py 4»-1=0); 
9 MAR (p ):=(MAR (p )+2""") mod M (py 4n-1=03 
10_—s end if; 
11 C(p):=C (—p)+A (Pp )*M [MAR (:P)] (p,.=0); 
12 = B(p)=B(p)+A (—)*M [MAR (p)] (R=) 
13 = flag :=false; 
14 GRAY(U-1,flag); 
end if; 
end; 


In the algorithm above, It takes M steps to broadcast 
window elements to local memory, and O (VM 2) multiplications 
to compute the C;;. With the same initialization, these two 
terms are the same for any network. So we only consider the 
inter-PE communication time. 


In the loop t of CUBE-N2-1, PHASE-2 takes approxi- 
mately M unit routes[1]. Rotating one column left of line 14 
takes logN unit routes. Outside the loop ¢t, rotating one image 
segment up of line 16 requires logN-—logM unit routes. The 
total inter-PE communication time iS 
M (M+logN )+logN—logM , which is O(M*+MlogN). Compar- 
ing with the 2M?+2M (logN—logM)+MlogN inter-PE commun- 
ication time given in [1], the time saving is about half, 
though it is still in the same order as that in algorithm 
CUBE-N2. 


A New Algorithm on Hypercube with N 2 PEs 


In this section, a new algorithm CUBE-N2-2 on hyper- 
cube network with NV? PEs is presented. This algorithm not 
only permutes image rows but also permutes image columns 
as well. The algorithm first computes the window column 
address and then computes the row address within a window 
column, both using Lemma 3. Since all the window columns 
should be held in local memory, M* local memory are needed. 
The algorithm CUBE-N2-2 is given below. A data movement 
example with N=8, M=4 is given in Fig. 2. 


The procedure CUBE-N2-2(A,B,C,D,E) uses five regis- 
ters from A to E. It is assumed that A(i,/) is initialized to 
G(i,j). Registers C,B,D and E are used to accumulate the 
partial results of C;; during execution and all of them are ini- 
tialized to 0. Match_along_column of line 13 computes par- 
tial results of C;; at each image element G(i,j/) and stores 
them to register C,8,D and E respectively. Lines 14-17 rotate 
the contents of D and E one segment left and add them to B 
and C. Lines 18-19 rotate the content of B one segment up 
and add it to register C. Upon completion, the register C of 
each PE contains the final result of C;; at image element 
G(i,j). 

In the procedure Match_along_column, variable t is the 
window column index, initialized to 0. The loop / of lines 3- 
12 executes the permute-multiply-accumulaté along the 
column direction. The procedure 
Match_within_column(A ,B,C ,t) has four parameters. Regis- 
ter A holds image data, C and B are used to accumulate the 
partial results. Variable t is the window column index and it 
is different for different PE columns. Variable s is window 
row index, initialized to 0. Line 2 calculates the local address 
of the window element of each PE. The loop k executes 
permute-multiply-accumulate along row direction. 


procedure CUBE-N2-2(A,B,C,D,E) 
begin 


for t:=0 to M—1 do 
for s:=0 to M—1 do 


3 
4 
> MAR (p):=0; 
6 
7 
8 CMAR :=O4t*M +5 ; 


9 M [MAR (p )]<=CM [CMAR }; 
10. MAR (p):=MAR (p +1; 
11 end for; 
12 end for; 


13. Match_along_column; 

14 ROTATE(D ,m,n-1,0); 

15 C(p):=C (p)+D (p); 

16 ROTATE(E,m,n-1,0); 

17 = B(p)=B(p)+E(p); 

18 ROTATE(B ,n+m,2n—1,0); 
19 C(p):=C (—p)+B (—p); 


procedure Match_along_column 
begin 
1 t:=0; 
2 Match_within_column(A,C,B ,t); 
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3 
4 
5 
6 
7 
8 
9 
10 
11 
12 


end; 


for /:=0 to m—1 do begin 
A(p)<—A (p); 
t:=(t+2') mod M (p,=0); 
t:=(t-2') mod M (p,=1); 
Match_within_column(A ,C ,B ,t) (p,=0); 
Match_within_column(A ,D,E ,t) (p,=1); 
F :=false; 
Us=l; 
Gray_column(U ,F ); 

end for; 


procedure Match_within_column(A ,C,B,t) 


begin 


Ss :=0; 

MAR (p ):=t*M ; 

C(p)=C (p)+A (p )*M [MAR (p)): 

for k:=n to n+m—1 do begin 
A(p)<—A (p); 
s:=s+2*™" mod M (p,=0); 
s:=s—2*™" mod M (p,=1); 
MAR (p ):=t*M +s ; 
C(p):=C (—p )+A (p )*M [MAR (lp )] (:p, =0); 
B(p):=B (p)+A (p )*M [MAR (p)] (=); 
F :=false; 
U :=k—-n; 
Gray_row(U ,F ); 

end for; . 

A(p@*"-) <A (p); 


procedure Gray_column(U,F) 


begin 


if U=0 then return else 
begin 
flag :=true; 
Gray_column(U —1 flag ); 
AO) <—A(p); 
if F then 
t:=t+2"-! mod M (py_\=0); 
t:=t-27-"! mod M (py_\=1); 
else 
t:=t-27-! mod M (py_;=0); 
t:=t+2¥"! mod M (py_;=1); 
end if; 
Match_within_column(A ,C ,B ,t) (p,=0); 
Match_within_column(A ,D ,E ,t) (p,=1); 


flag :=false; 
Gray_column(U -1,flag ); 
end if; 


procedure Gray_row(U ,F ) 


begin 


COMIN AR WN RR 


if U=0 then return else 
begin 
flag :=true; 
Gray_row(U —1,flag) 
A(p OU") <—A (p); 
if F then 
s:=s+27-! mod M (py4n-1=0); 
s:=s-2""! mod M (Pysn1=)3 
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9 else 


10 s:=s—27-! mod M (Pu +n-1=0)3 
11 s:=s+2"-! mod M (Pusn-1=)3 
12 end if; 


13 MAR (p ):=t*M +s ; 

14 C(p):=C (—p +A (p )*M [MAR (p )] (,=9); 
15 B(p)=B (p)+A (p )*M [MAR (:— )] (=); 
16 flag :=false; 

17 Gray_row(U —1,flag); 

18 ~—s end if; 


In the algorithm CUBE-N2-2, there is no rotation in 
Match_along_ column, nor in Match_within_column. The 
procedure Match_within_column takes M unit routes. In 
lines 7-8 of the loop / of Match_along_column, the procedure 
Match_within_column is called twice. Thus 
Match_along_column requires 2M? unit routes. Three 
ROTATES of lines 14-18 of CUB2-N2-2 needs 3(logN—logM ) 
unit routes. Therefore the total inter-PE communication time 
is 2M*+3(logN —logM ), which is O(M*+logN). 


Conclusion 


Two template matching algorithms on hypercube SIMD 
computer with N? PEs are presented. The first algorithm 
improved the local address computation scheme over the 
algorithm given by Fang[1]. The time complexity is reduced 
to half. The second algorithm extends the permute-multiply- 
accumulate operations in Gray code sequence to both column 
and row direction. The inter-PE communication time is 
reduced to O(M*+logN). The template matching computation 
is intrinsically a shift(rotate)-multiplication in nature, while 
the hypercube network can only perform exchange. Therefore 
the rotation must be implemented by sequences of exchanges, 
which takes O(logN) steps. In the second algorithm CUB- 
N2-2, we have the minimum rotation time, thus the 
O(M*+logN ) inter-PE communication time is optimal in this 
sense. The trade-off is that the local memory of each PE is 
increased to M?. 
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Fig. 2. One step of a data movement example of the algorithm CUBE-N2-2 for N=8 and M =4 is 
given here. In the table above, i, j is PE index, s,¢ iswindow index, A,B,C, D are registers to 
accumulate the result. 


The table shows the combined result of the loop k of lines 4-14 for k=n=3 in 
Match_within_column for both p;=0 and p;=1 of line 13-14 in Gray_column of line 11 in 
Match_along_ column. 
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Abstract In this 


-- paper we provide an 
adaptation of the algorithms for neighbor finding 
(NFA) and boundary following (BFA) on the 
hypercube architecture. We encode quadtree on a 


linear array and analyse two schmes for embedding 


them on the hypercube. We observe that the 
communication delay in NFA using block embedded 
linear quadtree is low. Finally, we provide a 
parallel adaptation of the BFA on the hypercube 
based on block embedding and derive an expression 
for its speedup. 
1. Introduction 

Region representation of images play an 

important role in image processing, VLSI design 


and computer graphics[2,3,5-9]. 
algorithms 


Most commonly used 
deal with finding the boundary of a 
region, finding the neighbors of nodes in the 
tree, and performing operations on these 
trees[2,3,5-9]. With the advent of multiprocessor 
systems, the need for development of parallel and 
efficient algorithms is being increasingly 
realised. In this paper we attempt to provide 
parallel adaptations for the existing 
algorithms[2,3] on the hypercube architecture[4]. 


2. ENCODING QUADTREES IN A LINEAR ARRAY[6-9] 


A node in a quadtree is a number called 
K value where, K value=k_ k 1 . k, and 0<=k,<5. 
Fach digit k, denotes the path to be taken” at 
level i, (O=términal 1=NW, 2=NE, 3=SW, 4=SE) to 


reach the node from root which is assigned a level 
O. For an image and its corresponding quadtree 
shown in Fig. 2.1, node 10 has a K value 0441. The 


K value gives a unique path from root to any node 
in the tree and the most significant non-zero 
entry in K_value specifies its sontype. 


n-1.4 K_valug yields to a unique decimal number 

Li) Si ae ai A 4 +... 4%k, + k, which may be 
: n-l, 2 1 : , 

used Bo index the tinear array représention with 
dimension [Oae” -1)*4/3] for a quadtree of 
depth n. Each element of this array contains 2 
bits of information used to represent the 
colortype of the node as follows. 
type colortype=(white, gray,not_used,black) ; 

A white or black node is a terminal node and 
has no children. However, with regard to the array 
representation of quadtrees, the nodes allocated 
for their children are labelled "“not_used". 
In the rest of the paper, we adopt the definitions 
and algorithms given in [2,3]. 


3. ALGORITHMS USING LINEAR QUADTREES [7-10] 


The BFA and the related NFAs 
reproduced here in pseudo Pascal. 
Algorithm NFA 3.1 EQUAL ADJ NEIGHBOR 
function EQUAL ADJ NEIGHBOR (K:K_ value; 

var K_ret:K value; d:side):boolean; 
{ finds an equal size node K _ ret which is 
adjacent along the d side of node kK. Function 
returns FALSE if such a node doesn’t exist } 


have been 
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begin 
if K=root node then EQUAL ADJ NEIGHBOR:=FALSE 
else begin 
i:=max_level; 
while K[i]=0 do 
begin 
K ret[il]: 
end; 
repeat K ret[i]:=mirror(K[i],d);i:=i-1 
until (adjacent(K[i+1],d)=FALSE) or (i<1); 
if adjacent(K[i+1l],d)=TRUE then 
EQUAL ADJ NEIGHBOR:=FALSE 
else begin 
for i:=i downto 1 do K ret[i]:=K[i]; 
EQUAL ADJ NEIGHBOR :=TRUE 
end 
end 
end; 


O;i:=i-1 


Algorithm NFA 3.2 CORNER ADJ NEIGHBOR 
function CORNER ADJ NEIGHBOR (K:K_ value; 


var K ret:K value; d:side; c: quadrant) :boolean; 
{finds a non-gray neighbor K_ret which is adjacent 
along side d of K and is aligned to the other side 
d’ such that c=quadrant(d,d’). Function returns 
FALSE if such a node doesn’t exist } 
begin 
if K=root node then CORNER ADJ NEIGHBOR:=FALSE 
else begin 
i:=max_level; 
while K[i]=0 do 
begin 
K_ ret[i]:=0;i:=i-1 
end; 
repeat K ret[i]:=0;i:=i-1 
until (adjacent(K[i+l],d)=FALSE) or (i<1l); 
if (adjacent(K[i+l],d)=FALSE) then 


begin 
for j:=l to i do K ret[j]:=K[j]; 
i:=i+1; 
while gray(K ret) do 
begin 


if level(K)>level(K_ ret) then 
K_ ret[i]:=mirror(K[i],d) 
else K_ ret[i]:=mirror(c,d) ; 


i:=i+1 
end; 
CORNER ADJ NEIGHBOR:=TRUE 
end 
else CORNER ADJ NEIGHBOR:=FALSE 
end 
end; 


Algorithm NFA 3.3 ALIGNED 
function ALIGNED (K1,K2:K value; d:side): boolean; 


{Given two nodes Kl and K2 such that KkK2 is 
adjacent along counterclockwise side of d of Kl, 
function returns TRUE if K1 and K2 are aligned 
along d side of KI else a FALSE. } 
begin 
if (Kl=root node) or (K2=root node) then 
ALIGNED : =FALSE 
else begin 
i:=1; 


while (K1l[i]<>0O) or (K2[i]<0) do i:=i-1; 
if level (K1)=level(K2) then ALIGNED:=TRUE 
else if level (K1)<level(K2) then 
interchange(K1,K2) ; 
repeat | 
if (K2[i]=0) and adjacent(K1l[i],d) then 
i:=i-1l 
until (i<l) or (K2[i]<>0) or 
not adjacent(K1[i],d); 
if i<l then ALIGNED:=FALSE 
else if K2[i]=0 then ALIGNED:=FALSE 
else ALIGNED:=TRUE 
end 
end; 


Algorithm BFA 3.4 BFA 
procedure BFA(K1,K2:K value;d:direction) ; 


{ Given two nodes Kl(black) and K2(white) which 
are adjacent along direction d of Kl, the 
algorithm traces the boundary of the region. The 


algorithm outputs a sequence of strings giving the 
direction and the length of the path to be 
traversed along that direction } 

The BFA algorithm is not reproduced here 
to space limitation and can be found in [10] 


due 


4. EMBEDDING LINEAR QUADTREES ON THE HYPERCUBE 


The hypercube is a MIMD machine having a_ d- 
dimensional cube interconnection topology with one 
processor module at each node of the cube. Each 
edge in the cube is constituted by a physical 
communication link[4]. A hypercube of dimension 
four is considered for embedding of quadtrees. The 
address of a PE is denoted by A = (a,a,a,4,) and 
the K_ value of a node in the quadtrée is dénoted 
by k_k -. Ky where 0 <= k, < 5. The quadtree can 
be Snkeddea tin the hypercube in two ways viz. 
Block Embedding and Complete Embedding. Details of 
the embedding are given below. 


4.1 Block Embedding 


In this case, sub-trees of the quadtree are 
assigned to different PE’s in the hypercube as 
shown in Fig. 4.1. The address of a PE in which a 
node in the quadtree would lie is given by the K, 
and K, digits of the node’s K value. 


4.2 Complete Embedding 


All the quadtree nodes are distributed to the 
PE's in the hypercube. The hashing procedure to 
find the PE address (pe_addr) of a quadtree node, 
is as follows : 
initial pe_addr + (0000),. 
for i := 1 to max_level do 
if k,0 then complement (k, )th bit of pe addr; 


2). COMPLEXITY ANALYSIS OF NFA 


Our adaptations of NFAs involves manipulation 
of K_ value. The K values of interest are initial 
and final ones and execution time depends on the 
communications delay between PEs in which the 
nodes reside. We assume a random image comprising 
many uniformly distributed connected regions such 
that a node is equally likely to appear in any 
position and level in a quadtree and _ the 
perimeters of regions have low standard deviation. 

Consider a block embedded quadtree. With 
regard to the NFA Equal_adj_ neighbor the possible 
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communication delay is between the neighbors which 
fall on the boundary such that two adjacent 
neighbors are on two different processors. 

For 16 processor allocation, there will be 6 


boundaries of total 2° leafnodes each (see Fig. 
4.1a). The adjacent neighbor pairs of size one 
leafnode falling on the boundary will be 6 * ae 
Similarly for pairs of size 4°" falling on _ the 
boundary will be 6 * 2 Total number of 


neighbor pairs 
therefore, be oe 
6%* (2°) + 2 a es 


falling on the boundary will 


nt+1 


+1)=6 * (2 -1). 
However, total number of neighbor pairs in 
the image [3] is 
n-1 : : 
y pn-i y cqn-i - 1) 
i=0 


Communication requirement (C) on the average 
will therefore be, 


6 x cgntl of) 
n-l ; : 
y ghn-i K One _ 1) 
i=0 
or, C=9%* 2” for n very large. 
Assuming an average communication length of 
two hops between processors, the total 


communication delay can be expressed as 

(C¥2)/16 = (9/8)*2— for large n. 

Similarly, it can be shown that all the other 
NFAs will also have the same communication delay, 
for large n. 

Whereas for a completely embedded quadtree, 
it is difficult to derive similar expressions for 
the average number of hypercube nodes visited. 
Programs were run for different NFAs and _ the 
results are presented in Table 5.1. 


TABLE 5.1 


Algorithm Average number of hops 
Equal _adjacent_neighbor 
GT_Equal_adj_ neighbor 
Equal_corner_neighbor 
Corner Adj neighbor 
Aligned 


6. BFA ON THE HYPERCUBE 


Prior to the description of BFA on _ the 
hypercube, we lay down the following assumptions. 

‘ An image comprises R regions which are 
uniformly distributed over the entire area A of 
the image. The area A is defined as number of 
pixel in the image-2 x2" for a n-level tree. 

2. The density of the image D is specified as 
ratio of black area to total area. Total area 
occupied by all regions, will therefore, be (D*A). 
3. The number of regions R in the image is very 
large. (R >> number of PEs) 

4. The image has been block embedded 


procedure Parallel BFA; 


var startl,start2:array [1..R] of K value; 
cobegin = 
for i:=l1 to R do (for all regions in image} 
begin 
locate startl[i] and start2[i]; 
BFA(startl[i],start2[i],N) 
end 
coend; 


6.1 Speedup 


For this algorithm, the time required to 
trace one boundary of perimeter g is O(@). The 
computation time on the parallel machine will 


therefore be 0(0/16), @ being the total perimeter. 


The maximum communication delay will occur only 
when boundary of the region coincides with the 
processor boundary, and all black and white node 


pairs are at leafnode level. In such a case, there 


can be a maximum 6 * 2” communications. If the 
communication time is t._ and execution time is t 
then total time required on the parallel system 
will be, (@/16) *t + (6% 27>) ¥*¢t 
and the speedup will be, . 
Ce * t.) 
((@/16) * t, + (6 * 2") * t) 

Assuming square regions, a maximum of & 
regions, and average area per region given by 
(D¥A/R), the total perimeter @ of all the, regions 


in the image can be expressed as 4% (D*AXR) ” 
Further, A=2 *2°, and with regards to VLSI 
Layouts[1], R«A, or R=y*A. Hence speedup is 
16 


bs -l 
1+ 24 * 2°" * (D * pA)” 


we. fe) 
= 16 (for large n). . : 


7. CONCLUSIONS 


From the simplicity and efficiency of 
implementation, it can be concluded that our 
approach to parallel processing can be readily 
extended to CAD _ tools, ViZs design rule 
checker[5], circuit extractor, routers etc. that 
are designed based on NFAs and BFA. 
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(a) Image (16 X 16) 
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(b) Quadtree for the Image 


FIG. 2.1 An Image and its corresponding Quadtree 
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Processor Boundaries 


Processor Boundaries 


(a) 16 Blocks of an Image and their PE assignments 
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(pb) Quadtree corresponding to the Image in (a) 
FIG. 4.1 An Image and its partitions 
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OPTIMISING A RECONFIGURABLE MIMD TRANSPUTER MACHINE FOR LINE-OF-SIGHT 
CALCULATIONS ON LARGE DIGITAL MAPS 
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Abstract 

A major demand for computing power 
in simulators and in systems for 
evaluating and optimally siting sensors is 
the tracing lines of sight to determine 
intervisibilities and calculating ‘depths 
of shadow’ between arbitrary points above 
3-D landscapes represented by large, high 
resolution digital maps. 

We show how a MIMD parallel 
machine with a switched interprocessor 
connection topology can be configured and 
programmed for these problems; the machine 
topology and allocation of tasks being 
rearranged to achieve spectacular 
performances on maps of differing 
resolution and for various graphical 
display requirements. The machine 
architecture is a modular network of 
Transputers capable of making arbitrary 
interconnections between as many as 1000 
processors. 


1. Philosophy of Machine Architecture 


Compact computers of very high 
potential performance can now be assembled 
rather easily from the present generation 
of single chip processing elements (PE). 
Realising a high proportion of the 
potential performance may be difficult 
however, particularly with irregular or 
data dependent algorithms. Even 
systematic problems need careful 
allocation of the tasks between processors 
to avoid serious under utilisation of the 
machine, and the communication strategy 
between processors assumes a vital role in 
preventing wasteful bottlenecks. However 
VLSI allows us to implement complex 
interconnection networks between 
processors to alleviate this problem. 

With several partners in an ESPRIT 
consortium [1], we are developing 
multi-transputer [2] processor machines 
based on the Communicating Sequential 
Processes (CSP) [3]/OCCAM model of 
computing [4]. A machine built on this 
model is characterised by distributed 
memory and point to point communication 
links. This model allows great hardware 
and software flexibility and versatility. 
For example it allows MIMD processing, 
modular construction of machines, fixed or 
floating point PEs and individual 
processor memory sizes, all augmented by 
the extra freedom to reconfigure the 
machine, dynamically if desired, by 
software control of the switches. 


UK 
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This architecture, known as RTP 
(Reconfigurable Transputer Processor) 
differs crucially from machines based on 
shared memory or shared communications 
buses, both of which entail heavy hardware 
costs in achieving high memory and 
communication bandwidths [5]. Like 
"hypercube’ machines, it is a coarse grain 
MIMD machine using 16 or 32-bit transputer 
PE’s which execute distinct programs. 
However the RTP topology consists of 
one-to-one channels set up by switches 
rather than a fixed low-diameter network 
through which data messages are routed. 

Any connection topology possible 
with the four bidirectional links 
available on each transputer can be 
realised for as many as 1000 PEs, 
including the use of multiple links 
between transputers. The RTP philosophy 
being to employ hardware techniques to 
match the machine to the problem. The 
network is set up automatically from the 
‘harness’ (known as the wiring diagram) 
which is always written to describe the 
allocation of processes to processors. 
cases where a problem involves several 
phases such as I/0, data sorting, 
scatter/gather operations etc., the 
configuration may optimised for each and 
set accordingly. There is no hardware 
obstacle to altering the switch 
connections dynamically, but operating 
system tools are not yet available to 
decide routings and to prevent the 
interruption of current transfers. 


In 


2. Hardware 


Although the RTP is still in 
development, several prototypes have been 
built and programmed in order to verify 
ideas. The standard hardware module for 
RTP machines is a 'SuperNode’ built around 
16 ‘worker’ PE’s which are normally 
Floating Point transputers (T800s). This 
transputer was developed as part of the 
ESPRIT project and is capable of 
calculation at a sustained rate of between 
1 and 2 MFlop/s. The worker transputers 
in a supernode are connected to each 
other, to a memory server, a hard disc 
unit and to other supernodes through two 
72 x 72 crossbar link switches, controlled 
by a ‘control’ transputer. These all 
communicate via standard Inmos links but 
the control transputers have an 
independent low bandwidth bus connected to 
all the transputers in the machine to 


allow diagnostics information to be 
extracted without disturbing the state of 
the RTP links. The 72 x 72 crossbar 
switch has been implemented using two 
15000-gate CMOS gate arrays. 

In the prototype RTP machines each 
worker PE has 256 KBytes of static RAM in 
addition to the T800’s 4 KBytes of 
internal memory. In later versions this 
will be optionally extended to 4 MBytes of 
dynamic RAM each. 

Larger machines are built up by 
interconnecting ’supernodes’ together 
using a further layer of switches. 
Sufficient links are available for the 
internode switches to allow the switching 
network to make any possible transputer 
graph in a ‘'rearrangeable’ manner [6]. 


3. Terrain Visibility and Shadowing 


A demanding application we have 
programmed on the RTP is the calculation 
of lines of light using digital maps. The 
programs calculate the ground that is 
visible to an observer at a given height 
and position over the map and also the 
depth of concealment (or ’shadow’ if we 
think of the observer as a light source). 

Computation on digital maps 
presents an interesting combination of 
benchmark tests. These computations are 
numerically intensive and can be done with 
either integer or floating point 
arithmetic. The maps involved can be 
large and the distribution of this 
database amongst many processors to allow 
multiple access to the database is in 
itself an interesting problem. 
Furthermore these applications are 
considerably enhanced by good graphical 
displays. Manipulating images to provide 
rapid response times is a further 
computational load and exercises the 
machines I/0 capability. 

A major factor in the performance 
of the visible area calculations is the 
number of rays traced. Computing N? rays 
for an N x N map is highly redundant and a 
reduction by a factor of order N is 
possible. A profile of the ground along 
each ray is formed and the the look 
elevation angle to each point on it. The 
program then decides if each point along 
the the profile is visible or not and 
colours the display accordingly. 

Two versions of the programs 
developed are described in sections 3.1 
and 3.2 below. ‘ 


3.1 Small Map 


When implementing this algorithm 
for a 256 x 256 byte map of the Isle of 
Wight, the future need to use much larger 
maps, which because of their size could 
not be stored in the memory of a single 
processor, was allowed for. With this in 
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mind a pipelined solution with different 
processors doing different tasks was 
adopted. Four stages of the algorithm were 
identified: 


i) Calculate the (x,y) coordinates 
along each ray uSing Bressenham’s 
Algorithm [7]. The change in x and/or y 
per step is at most l. 


ii) Using the (x,y) coordinates of a 
ray fill in the heights of the terrain 
along the ray creating a profile. 


iii) Starting from the observer end of 
the ray decide if each point along the ray 
can be seen. This is done by comparing 
the tangent of the angle between the 
observer and the height at that point 
along the ray with the maximum tangent 
calculated so far. If the tangent at that 
point is less than the maximum tangent so 
far that point along the ray is not 
visible, if greater it is visible. 


iv) From the result of stage (iii) 
decide what colour to display on the 
graphics screen and where, for each point 
along the ray. The colour convention 
adopted had the underlying terrain heights 
displayed as a red/green (or brown) colour 
bar and areas not visible coloured black 
or aS a grey scale in the depth of shadow 
computations. 


For the small map implementation 
(256 x 256 bytes) the first two stages 
above were implemented in one processor. 
The network of processors used is shown in 
figure 1. There are 4 parallel pipelines 
which work on 8 adjacent rays -— the rays 
are chosen to be adjacent to ensure good 
load balancing. The first processor in 
each pipeline performs stages (i) and (ii) 
above, each pipeline then splits into two, 
each sup-pipeline performing the 
shadowing computation (111). The two 
sub-pipelines in each pipeline then come 
together in the the fourth processor which 
does the colour computation (iv). Several 
other networks were used during program 
development but this one was chosen as it 
had gave the most balanced load amongst 
the processors. The load on each of the 
processors by the use of an ’efficiency 
monitor’ which shows how busy each of the 
processors was in the network. 


3.1.1 Performance 


Table 1 gives the execution times 
for a two metre high observer both at the 
centre of the map and in one corner. 

The depth of shadow calculation 
shown in the table is a calculation of how 
high a target would have to be to seen at 
that point on the map if the ground is not 
directly visible from the observer. The 


shadow is displayed as a graduated grey 
scale that saturates black at a user 
definable depth. 

In its final form the program ran 
in the times shown in table 1 on 16 T414 
17 MHz 'worker’ transputers, a T414-15 
B007 graphics board with 0.5 MByte Video 
RAM and 0.5 MByte DRAM and a BO004. The 
B004 is a transputer card with 2 MBytes of 
DRAM that acts as a host in an IBM-AT. 

The use of an efficiency monitor 
showed that for many cases the rate of 
execution of the program was limited by 
the speed at which the graphics processor 
could display the results. In this case 
there is nothing to be gained from using 
more 'worker’ processors. Cases where the 
graphics processor load dominates are 
characterised by similar times for integer 
and floating point calculations. 

With sub-second calculation times 
the use of a tracker-ball or mouse to move 
the position of the observer gives 
impressive interactive displays. 


3.2 Larger Maps 


Comparison 
map of the Isle of 
Survey (paper) map 
inaccuracies and a 
map was obtained. 
points of 16-bits, 
grid as opposed to 
earlier map. 

As the B007 graphics board drives 
a 512 x 512 screen, it was decided to 
limit the shadowing calculation to a 512 x 
512 section of the map at any one time, 
the rest of the map being held in RAM on 
the B004. With this limitation the 
storage required by the used area of the 
map was 0.5 MBytes. As each of the 
processors in the network has 256 KBytes 
of RAM, at least three processors have to 
be dedicated to the storage of each copy 
of the map. Splitting the map up amongst 
processors then allows parallel access to 
parts of the map, speeding up the height 
lookup process. 

The final network of processors 
used is shown in figure 2. This has been 
-gptimised to approximately equalise the 
load across the processors for shadowing 
calculations using integer arithmetic on 
T414s. The loading on the processors was 
measured using the efficiency monitor. 
Figure 2 should be viewed as two pipelines 
each of which split into two. The four 
Ssub-pipes process adjacent rays to balance 
the load between pipelines. Various 
configurations were readily explored 
because the machine configuration is 
automatically set up from the normal Occam 
description of the program. 

The main differences between the 
256 x 256 byte map and 512 x 512 16-bit 
map are discussed in the following 
sections 


of the 256 x 256 byte 
Wight with an Ordnance 
revealed several 

more detailed digital 
This map was 1024 x 512 
each spaced on a 50m 
the 100m grid for the 
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3.2.1 Map Look Up 


As noted above, the map storage is 
split amongst three processors. The 
x.y.list process calculates where the each 
ray enters or leaves each of the three map 
sections and passes these as parameters to 
the three look-up processors. This was 
found to be the most efficient way of 
organising the lookup. 


3.2.2 Performance 


Table 2 summaries the performance 
achieved on the larger map with the same 
processors described in section 3.1.1 
above. Performance could be improved by 
using faster processors (20 or 30MHz 
parts) and faster links with overlapped 
acknowledge. The processors used for the 
timings below link used 10 Mbit/s links. 


3.3.4 Displays 


The two photographs illustrate 
some of the points mentioned above, both 
photographs how part of the larger map. 

Photograph 1 shows the area 
visible to a 2m high observer on St 
Catherines Down at the cursor position. 
The version of the program used to 
calculate this required 5.89 sec and used 
integer arithmetic throughout. In the 
bottom left hand corner of all these 
photographs is a display of the 
utilisations of all the processors in the 
network during the calculation. The level 
of the white line indicates 100% 
utilisation. The first 8 columns refer to 
the top pipeline in figure 2, the next 8 
the lower pipeline and the most righthand 
column is the B0O07 graphics processor. 

The order of display within the pipelines 
is: 


(x.y.list, lookup 0, look up 1, look up 2, 
top shadow, top colour, lower shadow and 
lower colour.) 


Notice that no processor is 100% utilised. 
This shows that either the processors are 
waiting for inputs whilst doing no useful 
work or that the link bandwidth is 
saturated. The network was configured to 
maximise processor usage on visible area 
integer calculations. 

The second photograph is a display 
of the depth of shadow for an observer at 
the same point. The depth of shadow is 
displayed as a grey scale which saturates 
black when the the depth of shadow exceeds 
100m. This calculation has been done 
using floating point arithmetic and took 
17.35 sec. Notice here that the 4 
processors performing the the shadowing 
computations are 100% used and limit the 
computation. The network was not 
optimised for this mode of calculation. 


T800 floating point transputers execute 
the point point version of the code faster 
than the integer version of the code. 

This is because the floating point code is 
much simpler than the integer version as 
it applies no range checking or scaling of 
numbers. 


4. Summary and Conclusions 


This paper has described the 
performance of a small (1 cubic ft) 
prototype node for large MIMD machines 
built from transputers. The key features 
of this machine are distributed memory and 
reconfigurable point-to-point 
communications. 

The Intervisibility calculations 
demonstrate the power and versatility of 
the machine. The different processor 
configurations were easy to generate and 
program using the switches. This is the 
great strength and flexibility of the 
machine. Without the switches the 
tendency is to set up one network and use 
that regardless of how inefficient it is. 

The processing power of the 
machine can be seen from the execution 
times quoted above. The speed of 
execution of the visibility calculations 
allows the effects of arithmetic 
precision, interpolation etc to be quickly 
and easily investigated. The speed of 
execution allows searches for optimum 
sensor sites a realistic proposition. 
These investigation are rarely done on 
conventional machines due to excessive CPU 
time requirements. 
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Observer | Calculation | Arithmetic type 


Position 
x y type Floating pt Integer 


128 128 Shadow 
0 0 Shadow 

128 128 Depth of shadow 
0 0 Depth of shadow 


Table 1 Summary of execution times in seconds for 256 x 256 map 
These timings are largely limited by the speed of the graphics processor 


Relative Calculation Arithmetic type 


Position 
type Floating pt Integer 


Shadow 
Shadow 
Depth of shadow 
Depth of shadow 


Table 2 Summary of execution times in seconds with 512 x 512 map 
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Abstract -- This paper presents the results of the 
implementation of a Navier-Stokes algorithm on three 
parallel/vector computers. The object of this research is to 
determine how well, or poorly, a single numerical algo- 
rithm would map onto three different architectures. The 
algorithm is a compact difference scheme for the solution 
of the incompressible, two-dimensional, time dependent 
Navier-Stokes equations. The computers were chosen so 
as to encompass a variety of architectures. They are: the 
MPP, an SIMD machine with 16K bit serial processors; 
Flex/32, an MIMD machine with 20 processors; and 
Cray/2. The implementation of the algorithm is discussed 
in relation to these architectures and measures of the per- 
formance on each machine are given. Simple performance 
models are used to describe the performance. These 
models highlight the bottlenecks and limiting factors for 
this algorithm on these architectures. Finally conclusions 
are presented. 


I. Introduction 


Over the past few years a significant number of paral- 
lel computers have been built. Some of these have been 
one of a kind research engines, others are offered commer- 
cially. Both SIMD and MIMD architectures are included. 
A major problem now facing the computing community is 
to understand how to use these various machines most 
effectively. Theoretical studies of this question are valu- 
able. However, we believe that comparative studies, 
wherein the same algorithm is implemented on a number 
of different architectures, provide an equally valid way to 
this understanding. These studies, carried out for a wide 
variety of algorithms and architectures, can highlight those 
features of the architectures and algorithms which make 
them suitable for high performance parallel processing. 
They can exhibit the detailed features of an architecture 
and/or algorithm which can be bottlenecks and which may 
be overlooked in theoretical studies. The success of this 
approach depends on choosing "significant" algorithms for 
implementation and carrying out the implementation over a 
wide spectrum of architectures. If the algorithm is trivial 
or embarrassingly parallel it will fit any architecture very 
well. We need to use algorithms which solve hard prob- 
lems which are attacked in the scientific and engineering 
community. 


In this paper we present the results of the implementa- 
tion of an algorithm for the numerical solution of the 
Navier-Stokes equations, a set of nonlinear partial 
differential equations. In detail, the algorithm is a compact 
difference scheme for the numerical solution of the 
incompressible, two dimensional, time dependent Navier- 
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Stokes equations. The implementation of the algorithm 
requires the setting of initial conditions, boundary condi- 
tions at each time step, time stepping the field, and check- 
ing for convergence at each time step. Equally important 
to the choice of algorithm is the choice of parallel comput- 
ers. We have chosen to work on a set of machines which 
encompass a variety of architectures. They are: the MPP, 
an SIMD machine with 16K bit serial processors; Flex/32, 
an MIMD machine with 20 processors; and Cray/2. The 
basic comparison which we make is among SIMD instruc- 
tion parallelism on the MPP, MIMD process parallelism on 
the Flex/32, and vectorization of a serial code on the 
Cray/2. The implementation is discussed in relation to 
these architectures and measures of the performance of the 
algorithm on each machine are given. In order to under- 
stand the performances on the various machines simple 
performance models are developed to describe how this 
algorithm, and others, behave on these computers. These 
models highlight the bottlenecks and limiting factors for 
algorithms of this class on these architectures. In the last 
section of this paper we present a number of conclusions. 


II. The numerical algorithm 


The Navier-Stokes equations for the two-dimensional, 
time dependent flow of a viscous incompressible fluid may 
be written, in dimensionless variables, as: 


ou . ov 


ay + Oy = 0, (2.1) 
Ov Ou | 
ae oy = 6, 2): 
eo, oO <2 _ tw 
ey + a (uf) + a (v C) Re veg, (2.3) 


where 7#@= (u,v) is the velocity, € is the vorticity and Re is 
the Reynolds number. 


The numerical algorithm used to solve equations (2.1) 
to (2.3) was first described by Gatski, et al. [6]. This algo- 
rithm is based on the compact differencing schemes which 
require the use of only the values of the dependent vari- 
ables in and on the boundaries of a single computational 
cell. Grosch [8] adapted the Navier-Stokes code to ICL- 
DAP. Fatoohi and Grosch [3] solved equations (2.1) and 
(2.2), the Cauchy-Riemann equations, on parallel comput- 
ers. The algorithm is briefly described here. 


Consider equations (2.1) to (2.3) in the square domain 
O<sx<i1, 0<y<1 with the boundary conditions u= 1 
and v = 0 at y= 1 and u=v=0 elsewhere. Subdivide the 
domain into rectangular cells. The center of a cell is at 
(i+1/2,j+1/2). Apply the centered difference operator to 


equations (2.1) to (2.2), to get 


6.0 n12j12 + SVinrrji2 = 9, (2.4) 


8Vivin jin — SV jie = Surin: (2.5) 

The adaptation of this algorithm to different parallel 
architectures can be simplified by the introduction of box 
variables to represent U. The box variables, P’, are defined 
at the corners of the cells so that the average of two adja- 
cent P’s is equal to the U on the included side. The set of 
difference equations and boundary conditions in terms of 
the box variables are solved using a cell relaxation scheme 
which is equivalent to an SOR method [6], [8]. 


The compact difference approximation to equation 
(2.3) results in an implicit set of equations which are 
solved by an ADI method [4]. This method consists of 
two half steps to advance the solution one full step in time. 
Let At be the full time step and apply finite difference 
operators to equation (2.3), to get 


BE CRM? — (1 + 2oc) CAEN? + yf) CHM? = Fj, (2.6) 
BY crt - (1 + 200) Crt + yO CH =G;;, (2.7) 
where 
Fj = BE) Cha — CL ~ 20") Ch — YEP Ch. 
= BR CHM? — (1 — 20x) Catt? — of) Cmati2 
ie = At / 2(Ax)? Re, of) = At / 2(Ay)? Re, 


BY = al + Ar U1; /4(Ax), BE? = a? + At V;,1/4(Ay); 


1? = af) - yp) = a) — At V;..1/4(Ay);. 


The velocity field is not defined at the corners of the cells 
in this scheme; however, it can be computed using the box 
variables at the two immediate interior neighbors along the 
vertical and horizontal lines. Equation (2.6) represents a 
set of independent tridiagonal systems (one for each verti- 
cal line of the domain). Similarly, equation (2.7) 
represents a set of independent tridiagonal systems (one for 
each horizontal line of the domain). The ADI method for 
equation (2.3) is applied to all interior points of the 
domain. The values of € on the boundaries are computed 
using equation (2.2), see [2] for details. 


At Uist /4(Ax);, At V3 5 


The key to the adaptation of the relaxation scheme for 
solving equations (2.1) and (2.2) to parallel computers is 
the realization that each P’ is updated four times in a 
sequential sweep over the array of cells. This fact is util- 
ized by using reordering to achieve parallelism. The com- 
putational cells are divided into four sets of disjoint cells 
so that the cells of each set can be processed in parallel 
[3]. It is therefore clear that the cell iteration for the box 
variables is a four "color" scheme. Thus four steps are 
necessary for a complete relaxation sweep. 


The main issue in implementing the ADI method for 
equation (2.3) on parallel computers is choosing an 
efficient algorithm for the solution of tridiagonal systems. 
Two algorithms are considered here: Gaussian elimination 
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and cyclic elimination, [4], [9]. The Gaussian elimination 
algorithm is based on an LU decomposition of the 
tridiagonal matrix. This algorithm is inherently serial 
because of the recurrence relations in both stages of the 
algorithm. However, if one is faced with solving a set of 
independent tridiagonal systems, then Gaussian elimination 
will be the best algorithm to use on a parallel computer; all 
systems of the set are solved in parallel. The cyclic elimi- 
nation algorithm is a variant of the cyclic reduction algo- 
rithm [9] applying the reduction procedure to all of the 
equations and eliminating the back substitution phase of 
the algorithm. Cyclic elimination is most suitable for 


-machines with a large natural parallelism, like the MPP. 


The solution procedure for the Navier-Stokes equa- 
tions can be summerized as follows: 


(1) Assume that € is zero everywhere at t= 0. The vari- 
ables and boundary values are initialized. 


(2) The vorticity at the corners of the domain, undefined 
in this scheme, is approximated using the values of its 
neighboring points. The values of ;,12;412 are com- 


puted using the values of € at the corners of the cells. 


(3) The relaxation process is implemented for each 
"color", i.e. four times in order to complete a sweep. 
The maximum residual is computed and tested against 
the convergence tolerance. The whole process is 


repeated until the iteration aegis 
The coefficients a, a, BY, BP, ¥f?, ye? for both 
passes of the ADI method are computed. 

The values of € on the boundaries are computed. 

The tridiagonal equations distributed over columns, 


equation (2.6), are solved. 


The tridiagonal equations distributed over rows, equa- 
tion (2.7), are solved. 


These steps were implemented using the following subpro- 
grams: setbc, step (1); zcntr, step (2); relaxd, step (3); cof, 
step (4); zbc, step (5); triied, step (6); and trijed, step (7). 
The repetition of steps (2) through (7) yields the values of 
the velocity and vorticity at any later time. 


(4) 


(5) 
(6) 


(7) 


III. Implementation on the MPP 


The Massively Parallel Processor (MPP) is a large- 
scale SIMD processor built by Goodyear Aerospace Co. for 
NASA Goddard Space Flight Center [1]. The MPP is a 
back-end processor for a VAX-11/780 host, which supports 
its program development and I/O needs. 


The MPP’s high level language is MPP Pascal [7]. It 
is a machine-dependent language which has evolved from 
Parallel Pascal, an extended version of Pascal with a syntax 
for specifying array operations. These extensions provide a 
parallel array data type and operations on these arrays. 


The Navier-Stokes algorithm, described in section II, 
was implemented on the MPP using 127 x 127 cells 
(128 x 128 grid points). The computational cells are 
mapped onto the array so that each corner of a cell 
corresponds to a processor. The seven subprograms of this 
algoritim (see section I‘) were written in MPP Pascal. 


These subprograms were executed entirely on the MPP; 
only I/O routines were run on the VAX. 


The relaxation process, subprogram relaxd, was 
implemented on the array using the four color relaxation 
scheme [3]. The ADI method, subprograms triied and 
trijed, was implemented by solving two sets of 128 tridiag- 
onal systems using the cyclic elimination algorithm. This 
is done in parallel on the array with a tridiagonal system of 
128 equations being solved on each row or column. 


One of the problems in solving Navier-Stokes equa- 
tions on the MPP is the size of the PE memory. The 
relaxation subprogram uses almost all of the 1024 bit PE 
memory; 22 parallel arrays of floating point numbers, all 
but 5 of which are temporary. Although the staging 
memory can be used as a backup memory, this causes an 
I/O overhead and reduces the efficiency. This problem was 
solved by declaring all parallel arrays as global variables 
and using them in procedures for more than one purpose. 
Beside this memory problem, there are problems in using 
MPP Pascal to perform vector operations and to extract 
elements of parallel arrays. Operations on vectors are per- 
formed by expanding them to matrices and performing 
matrix operations; thus the processing rate is 1/128 of that 
for matrix operations. MPP Pascal does not permit extract- 
ing an element of a parallel array. This means that scalar 
operations involving elements of parallel arrays need to be 
expanded to matrix operations or performed on the VAX. 

The relaxation subprogram is quite efficient; almost 
all of the operations are matrix operations, no vector and 
only two scalar operations per iteration, with data transfers 
only between nearest neighbors. The ADI subprograms are 
reasonably efficient; mostly matrix operations with few 
scalar and no vector operations. However, both algorithms 
have some hidden defects. In updating the box variables 
for each set in the relaxation scheme only one forth of the 
processors do useful work; the remaining processors are 
masked out. This is because only one corner of each cell 
of a set is updated each time. For each level of the elimi- 
nation process in the cyclic elimination algorithm, a set of 
data is shifted off the array and an equal set of zeros is 
shifted onto the array. This means that some of the pro- 
cessors are not doing useful work; here they are either 
multiplying by zero or adding a zero. This is a problem 
with many algorithms on SIMD machines. 


Table I contains the execution time for each subpro- 
gram of the algorithm, that for one iteration in the case of 
relaxd; the percentage of the total time spent in that sub- 
program; and the processing rate. It is clear, from Table I, 
that the majority of the time was spent in relaxd for this 
particular run. This is because the average time step 
requires about 270 iterations and the total time spent in the 
other subprograms ( zcntr, cof, zbc, triied, trijed ) is only 
about the time to do two iterations of relaxd. The number 
of iterations in relaxd per time step depends on the data 
used during a given run. A different input data set could 
result in a smaller number of iterations per time step and 
relatively less time spent in the relaxation subprogram. 
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Table I. Measured execution time and processing rate of 
the Navier-Stokes subprograms for the 128 x 128 problem 
on the MPP. 


Execution 
time (msec) 


Processing 
rate (MFLOPS) 


roveralé | ~a1.s97 | 100.00 [155 


* per iteration. 


# for ten time steps (execution time is in seconds here). 


The processing rates in Table I are determined by 
counting only the arithmetic operations which truly contri- 
bute to the solution. Scalar and vector operations which 
were implemented as matrix operations are counted as 
scalar and vector operations. This is the reason why the 
subprograms zbc and zcntr have low processing rates; zbc 
has only vector operations while zcntr has some scalar 
operations implemented as matrix operations. The subpro- 
gram setbc has mostly scalar and data assignment opera- 
tions which reduce its processing rate. Beside these three 
subprograms, the processing rate ranges from 125 to 155 
MELOPS with an average rate of about 140 MFLOPS. 


In order to estimate the execution time of an algo- 
rithm on the MPP, the numbers of arithmetic and data 
transfer operations are counted and the cost of each opera- 
tion is measured. This is illustrated in the following 
model. Only operations on parallel arrays are considered. 


The execution time of an algorithm on the MPP, 7, 
can be modeled as: 


Dd tig 4 cams (3.1) 

Temp = te (Na Ca + Nn Cm + Na Ca), (3.2) 

Tomm = te (Noh Csh + Not Cot), (3.3) 

where | pe and T4., are the computation and communica- 


tion times; 7, is the machine cycle time (¢, = 100 nsec); N 
Nim Nav Nsn and Ng are the numbers of additions, multipli- 
Cations, divisions, shift operations, and steps shifted; and 
Cy Cw Ca Cop and C,, are the numbers of cycles for 
addition, multiplication, division, startup shift operation, 
and each step of shift operation. Table II contains the 
measured values of the basic floating point operations. 


Table IT. Measured execution times (in machine cycles) of 
the floating point operations in MPP Pascal. 


Add | Multiply | Divide | One step 


shift 


- Table II contains the operation counts per grid point 
for the Navier-Stokes subprograms on the MPP using the 
cyclic elimination algorithm for solving the tridiagonal sys- 
tems. Note that scalar and vector operations (in zcntr and 
zbc), which were implemented as matrix operations, are 
considered here as matrix operations. Table IV contains 
the estimated computation and communication times using 
equations (3.2) and (3.3) and Tables II and II. The cost of 
scalar operations is not included in this model; this 
explains the differences between the estimated and meas- 
ured times for setbc and cof. Apart from these two subpro- 
grams, the difference between the total estimated and 
measured times ranges between 3% to 8% of the measured 
times. The amount of time spent on data transfers is quite 
modest; from 6% for relaxd to 25% for triied and trijed. 
This is because this algorithm does not contain many data 
transfers and these transfers are only between nearest 
neighbors except for the tridiagonal solvers. 


Table III. Operation counts per grid point for the Navier- 
Stokes subprograms on the MPP, using the cyclic elimina- 
tion algorithm for solving the tridiagonal systems. 


Steps 
shifted 


* per iteration. 


Table IV. Estimated execution times (in milliseconds) of 
the Navier-Stokes subprograms on the MPP. 


Total est. 


Measured 
time 1 


time 


IV. Implementation on the Flex/32 
The Flex/32 is an MIMD shared memory multiproces- 


sor based on 32 bit National Semiconductor 32032 
microprocessor and 32081 coprocessor [5]. The results 
presented here were obtained using the 20 processor 
machine at NASA Langley Research Center. 


The machine has ten local buses; each connects two 
processors. These local buses are connected together and 
to the common memory by a common bus. The 2.25 
Mbytes of the common memory is accessible to all proces- 
sors. Each processor contains 4 Mbytes of local memory. 
Each processor has a cycle time of 100 nsec. 
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The Navier-Stokes algorithm, described in section II, 
was implemented on the Flex/32 using 64 x 64 grid points 
(63 x 63 cells) and 128 x 128 grid points (127 x 127 
cells). The main program as well as the seven subpro- 
grams of the algorithm were written in Concurrent Fortran, 
which comprises the standard Fortran 77 language and 
extensions that support concurrent processing. 


The parallel implementation of the Navier-Stokes 
algorithm is done by assigning a strip of the computational 
domain to a process and performing all the steps of the 
algorithm in each process. The main program performs 
only the input and output operations and creates and 
spawns the processes on specified processors. In our 
implementation, we used 1, 2, 4, 8, and 16 processors of 
the machine. The domain is decomposed first vertically for 
the first six subprograms ( setbc, zcntr, relaxd, cof, zbc, 
and triied ) and then horizontally for the subprogram trijed. 
The relaxation scheme for each strip was implemented 
locally. After relaxing each set of cells, each process 
exchanges the values of the interface points with its two 
neighbors through the common memory. The tridiagonal 
equations were solved using the Gaussian elimination algo- 
rithm for both passes of the ADI method. Data is stored in 
the common memory, in the local memory of each proces- 
sor, or in both of them. 


In order to satisfy data dependencies between seg- 
ments of the code, a counter is used. This counter, which 
is a shared variable with a lock assigned to it, can be incre- 
mented by any process and be reset by only one process. 
It is implemented as a "barrier" where all processes pause 
when they reach it. A set of flags are also used for syn- 
Chronization in the relaxation subprogram. 

Table V contains the speedups and efficiencies as 
functions of the number of processors for the 64 x 64 and 
128 x 128 problems. The measured execution times and | 
processing rates using 16 processors are listed in Table VI. 
The efficiency of the algorithm ranges from about 94%, for 
the 64 x 64 problem using 16 processors, to about 99%, 
for the 128 x 128 problem using two processors. 


Table V. Speedup and efficiency as functions of the 
number of processors, p, of the Navier-Stokes algorithm on 
the Flex/32. 


64 x 64 points 128 x 128 points 


speedup efficienc speedup efficienc 


Table VI. Measured execution times for ten time steps and 
processing rates for the Navier-Stokes algorithm using 16 
processors of the Flex/32. 


Execution time 
(sec) 


Processing rate 
(MFLOPS) 


Problem size 


The performance model is based on estimating the 
values of the overheads resulting from running the algo- 
rithm on more than one processor. The execution time of 
an algorithm on p processors of the Flex/32, T,, can be 
modeled as: 


T, = Temp + Tovrs (4.1) 
where T,,,, is the computation time and T,,, is the over- 
head time. Let fj7 be a load distribution factor where 
fia = 1 if the load is distributed evenly between the proces- 
sors and fj, > 1 if at least one processor has less work to 
do than the other processors. Then the computation time 
on p processors can be computed by 


Temp = fla T, / D, (4.2) 
where T, is the computation time using a single processor. 
The overhead time can be modeled by: 


Tor = Tomo + T spn + T syns (4.3) 


where T,. is the common memory overhead time, T,,,, is 
the spawning time of p processes, and T,,,, is the synchron- 
ization time. Three components of the common memory 


overhead time can be identified: 


Tomo = Tema + Tept + Temp (4.4) 


where Tq is the common memory additional time - this 
results from storing additional variables in the common 
memory; 7,; is the common plus local memory time - this 
results from storing variables in both the common and 
local memories; T,,,; 18 the common minus local memory 
time - this results from storing variables in the common 
memory instead of local memory. The values of T,,,, T. 


spn? * syn? 
Tomar T cpr 20d Ty Can be estimated as follows: 

Ton = P bop (4.5) 

Ton = P Kick teks (4.6) 

Tema =" Kema foc\P) tema (4.7) 

Tept =" Kept (foc) tema + tima ): (4.8) 

Temi = 1 Kemi (Soc(P) tema ~ tima )s (4.9) 


where f,,,, is the time to spawn one process - a reasonable 
value is 13 msec; t,,, is the time to lock and unlock a vari- 
able - a reasonable value is 47 usec; t,,,, is the time to 
access a variable in common memory - a reasonable value 
is 6 usec; t,,, is the time to access a variable in local 
memory - a reasonable value is 5 sec; k,,, is the number 
of times a variable is locked and unlocked for each pro- 
CeSS; Kinq is the number of times an additional variable is 
referenced; Kept is the number of times a variable is stored 
in both local and common memory; k,,,; is the number of 
times a variable is stored in common memory instead of 
local memory; and f,.(p) is the bus contention factor - it is 
a function of p. It is assumed that all memory operations 
are performed on vectors of length n. 


The performance of the Navier-Stokes algorithm is 
heavily influenced by the performance of the relaxation 
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subprogram; about 98% of the total time was spent in this 
subprogram. Since the number of cells is not divisible by 
the number of processors used, the last processor has less 
work to do than the other processors. Therefore, the load 
distribution factor, equation (4.2), can be computed by 


_|{n-1 p 
i= [2 = ). 


Using the performance model, equations (4.1) through 
(4.10), the overhead time represents at most 5% of the exe- 
cution time of the algorithm, including the load distribution 
factor. The overhead time of the relaxation subprogram 
dominates the total overhead time. The values of k,, and 
kema for each iteration of the relaxation process are 1 and 
8. The spawning time has a minor impact on the overhead 
time because the processes are spawned only once during 
the lifetime of the program. The synchronization time is 
insignificant because the routines that provide the locking 
mechanism are very efficient and overlap with the memory 
access. The bus contention factor is very small. The com- 
mon memory additional time, T.,,., dominates the overhead 
time. This overhead results from accessing the interface 
points for each iteration of the relaxation subprogram. The 
other components of the common memory overhead time, 
Tp; and T,,,;, have a negligible impact on the total over- 
head time because these operations are performed only 
once during every time step. 


(4.10) 


V. Implementation on the Cray/2 


The Cray/2 is an MIMD supercomputer with four 
Central Processing Units, a foreground processor which 
controls I/O and a main memory. The results reported here 
were obtained using the old Cray/2 at NASA Ames 
Research Center; the new one has a shorter main memory 
access time than the old one. 


The Navier-Stokes algorithm, described in section II, 
was implemented on one processor of the Cray/2 using 
64 x 64 and 128 x 128 grid points. The reordered form of 
the relaxation scheme, the four color scheme, was imple- 
mented on the Cray/2 with no major modifications. The 
reordering process removes any recursion because each of 
the four sets (colors) contains disjoint cells. The two sets 
of the tridiagonal systems were solved by the Gaussian 
elimination algorithm for all systems of each set in parallel. 
This was done by changing all variables of the algorithm 
into vectors running across the tridiagonal systems. The 
inner loops of all of the seven subprograms of the Navier- 
Stokes algorithm were fully vectorized. The local memory 
was used to store some of the variables, whenever that was 
possible. This reduces main memory conflicts and speeds 
up the calculation. 


Tables VII and VIII contain the execution time for 
each subprogram, the percentage of the total time spent in 
that subprogram, and the processing rate for the 64 x 64 
and 128 x 128 problems. Most of the time was spent in 
relaxd, and the average time step requires about 110 itera- 
tions for the 64 x 64 problem and about 270 iterations for 
the 128 x 128 problem. The subprogram setbc has a low 
processing rate because it has mostly memory access and 


scalar operations; however, this subprogram is called only 
once during the lifetime of the program. Beside this sub- 
program, the processing rate ranges from 57 to 97 
MFLOPS with an average rate of about 70 MFLOPS for 
the subprograms of both problems. 


Table VII. Measured execution time and processing rate of 
the Navier-Stokes subprograms for the 64 x 64 problem on 
one processor of the Cray/2. 


Execution 
time (msec) 


Processing 
rate (MFLOPS) 


triied 
| trijed | 


[overall | 3.048 | 100.00 | 96 


* per iteration 
# for ten time steps (execution time is in seconds here). 


Table VIII. Measured execution time and processing rate 
of the Navier-Stokes subprograms for the 128 x 128 prob- 
lem on one processor of the Cray/2. 


Execution 
time (msec) 


Processing 
rate (MFLOPS) 


pet eae eee 


_ * per iteration 


# for ten time steps (execution time is in seconds here). 


Based on the fact that Cray vector operations are 
“stripmined" in sections of 64 elements, the time required 
to perform arithmetic and memory access operations on 
vectors of length L,,, can be modeled as follows: 


m= (os 


d Ly + Lye) Np CP, 


(5.1) 
Ler 
Tp = (5 be Ly + >) Np CP, (5.2) 
r= (2 | Lm + Ri Lyer) Nm CP, (5.3) 
Lycr Lyer 4 
T,5 = ( os L, + Ro 5 ) No CP, (5.4) 
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where Ty and Tp are the times to perform floating point 
operations with strides of 1 and 2; T7,,, and T,,5 are the 
times to perform main memory access operations with 
strides of 1 and 2; CP is the clock period (CP = 4.1 nsec); 
L, is the length of main memory to registers path 
(Ly, = 56 CPs); Ly is the length of floating point functional 
unit (Ly = 23 CPs); R, and R, are the data transfer rates 
through main memory with strides of 1 and 2 (reasonable 
values are R, = 1 and R, = 3.5, although competition from 
other processors causes a lower transfer rates and hence 
increased values of R, and R); Nq and Ny are the 
numbers of floating point operations with strides of 1 and 
2; and N,,, and N,,5 are the numbers of main memory 
access operations with strides of 1 and 2. 


Table IX contains the operation counts per grid point 
for the Navier-Stokes subprograms using the Gaussian 
elimination algorithm for solving the tridiagonal systems. 
These operations are performed on all grid points of the 
domain except for zbc where they are performed on vec- 
tors. Tables X and XI contain the estimated times of the 
Navier-Stokes subprograms for the 64 x 64 and 128 x 128 
problems. These times are obtained using equations (5.1) 
to (5.4) and Table IX. It is assumed that each division 
takes four times the multiplication time. The main 
memory access time for each subprogram represents about 
50% to 70% of the total estimated and measured time. 
This shows that the Cray/2 is a memory bandwidth bound 
machine. The memory stride of 2 in relaxd causes more 
than a 50% slowdown in data transfer rate. The difference 
between the total estimated and measured values can be 
attributed to several causes. Among these are: the memory 
access and arithmetic operations can overlap, the time to 
perform scalar operations is not included, and there is up to 
20% offset on the results depending on the memory traffic 
and the number of the active processes. Finally, this 
model does not take into account the overlapping between 
segments of long vectors for the same operation. How- 
ever, it was found that this overlapping is insignificant for 
Fortran programs. 


Table IX. Operation counts per grid point for the Navier- 
Stokes subprograms on the Cray/2, using the Gaussian 
elimination algorithm for solving the tridiagonal systems. 


Memory 
access 


trited 


* per iteration # vector operations. 


Table X. Estimated and measured execution times (in mil- 
liseconds) of the Navier-Stokes subprograms for the 
64 x 64 problem on one processor of the Cray/2. 


Measured 
time 


Table XI. Estimated and measured execution times (in mil- 
liseconds) of the Navier-Stokes subprograms for the 
128 x 128 problem on one processor of the Cray/2. 


Measured 
time 


triied 
trijed 


VI. Comparisons and Concluding Remarks 


There are a number of measures that one can use to 
compare the performance of these parallel computers using 
a particular algorithm. One is the processing rate and 
another is the execution time (see Tables I, VI, VII and 
VIII). However it must be borne in mind that both of 
these measures depend on the architectures of the comput- 
ers, the overhead required to adapt the algorithm to the 
architecture, and the technology, that is, the intrinsic pro- 
cessing power of each of the computers. 


If we consider a single problem, a ten time step run 
of the algorithm on a 128 x 128 grid, then the processing 
rate is a maximum for the MPP, 155 MFLOPS, compared 
to 97 MFLOPS for the Cray/2, and only 1.13 MFLOPS on 
16 processors of the Flex/32. The low processing rate of 
the algorithm on the 16 processors of the Flex/32 is simply 
due to the fact that the National Semiconductor 32032 
microprocessor and 32081 coprocessor are not very power- 
ful. Although the algorithm has a higher performance rate 
on the MPP than on the Cray/2, it takes less time to solve 
the problem on the Cray/2 than on the MPP. This is due 
to the algorithm overhead involved in adapting the algo- 
rithm to the MPP. As shown in Tables III and IX, each 
iteration of the relaxation process has 145 arithmetic opera- 
tions per grid point on the MPP compared to 66 operations 
per grid point on the Cray/2. Also, the cyclic elimination 
algorithm, used on the MPP, has 92 arithmetic operations 
per grid point while the Gaussian elimination algorithm, 
used on the Cray/2, has only 10 operations per grid point; 
not including computation of the forcing terms. 
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The implementation of the algorithm on the Flex/32 
has the same number of arithmetic operations per grid 
point as on the Cray/2; there is only a reordering of the 
calculations and no additional arithmetic operations in the 
overhead. The algorithmic overhead for the Flex/32 ver- 
sion is the cost of exchanging the values of the interface 
points and setting the synchronization counters for the 
relaxation scheme and accessing the common memory for 
the ADI method. This means that the code on each pro- 
cessor is the serial code plus the overhead code. When the 
code is run on one processor, it is just the serial code with 
the overhead portion removed. 


Another measure of performance is the number of 
machine cycles required to solve a problem. This measure 
reduces the impact of technology on the performance of 
the machine. For the 128-x 128 problem, for example, the 
ten time step run requires about 416 billion cycles on the 
MPP, 7387 billion cycles on the Cray/2, and 25871 billion 
cycles on 16 processors of the Flex/32. This means that 
the MPP outperformed the Cray/2, by a factor of 18, and 
the latter outperformed the Flex/32, by a factor of 3.5, in 
this measure. This also means that one processor of the 
Cray/2 outperformed 16 processors of the Flex/32 even if 
we assume that both machines have the same clock cycle. 
The problem with the Flex/32 is that, although each pro- 
cessor has a cycle time of 100 nsec, the memories (local 
and common) have access times of about 1 sec. 


One simple comparison between the MPP and Cray/2 
is the time to perform a single arithmetic operation using 
the models developed in sections III and V. Using equa- 
tion (5.1), the time to perform a single floating point opera- 
tion (addition or multiplication) on an array of size 
128 x 128 elements on the Cray/2, excluding the memory 
access cost, is 91.3 sec. The time to perform the same 
operation on the MPP using MPP Pascal, see Table II, 
ranges from 81.1 Ltsec (for multiplication) to 96.5 psec (for 
addition). This shows that the processing power of a sin- 
gle functional unit of the Cray/2 is comparable to the pro- 
cessing power of the 16384 processors of the MPP. How- 
ever, much of the overhead is not included in this com- 
parison: memory access cost on the Cray/2, data transfers 
on the MPP, and so on. 


This experiment showed that by reordering the com- 
putations we were able to implement the relaxation scheme 
on three different architectures with no major 
modifications. Two different algorithms, Gaussian elimina- 
tion and cyclic elimination, were used to solve the tridiago- 
nal equations on the three architectures; the two algorithms 
were chosen to exploit the parallelism available on these 
architectures. The algorithm exploits multiple granularities 
of parallelism. The algorithm vectorized quite well on the 
Cray/2. A fine grained parallelism, involving sets of single 
arithmetic operations executed in parallel, is obtained on 
the MPP. Parallelism at higher level, large grained, is 
exploited on the Flex/32 by executing several program 
units in parallel. 


The performance model on the MPP was fairly accu- 
rate on predicting the execution times of the algorithm. 
The performance model on the Flex/32 showed the impact 


of various overheads on the performance of the algorithm. 
The performance model on the Cray/2 was based on 
predicting the execution costs of separate operations. This 
model is used to identify the major costs of the algorithm 
and reproduced the measured results with an error of at 
most 35%. 


The ease and difficulty in using a machine is always a 
matter of interest. The Cray/2 is relatively easy to use as a 
vector machine. Existing codes that were written for serial 
machines can always run on vector machines. Vectorizing 


the unvectorized inner loops will improve the performance > 


of the code. Unlike parallel machines, vector machines do 
not have the problem of "either you get it or not". The 
Flex/32 is not hard to use, except for the unavailability of 
debugging tools which is a problem for many MIMD 
machines (a synchronization problem could cause a pro- 
gram to die). On the other hand, the MPP is not a user- 
friendly system. The size of the PE memory is almost 
always an issue. MPP Pascal does not permit vector 
operations on the array nor does it allow extraction of an 
element of a parallel array. The MCU has 64 Kbytes of 
program memory. This memory can take up to about 1500 
lines of MPP Pascal code. This means that larger codes 
~ can not run on the MPP. Finally, input/output is somewhat 
clumsy on the MPP. However, other machines with archi- 
tectures similar to the MPP may not have the same prob- 
lems that the MPP does. 


There is one further observation of interest. This 
algorithm can be implemented concurrently on four proces- 
sors of the Cray/2 (multitasking). The code will be similar 
to the Flex/32 version except that most of the variables 
should be stored in the main memory. Adapting this algo- 
rithm to a local memory multiprocessor with a hypercube 
topology should be relatively easy. A high efficiency is 
predicted in this case because all data transfers are to 
nearest neighbors and their cost should be very small com- 
pared to the computation cost. 
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ABSTRACT 


A concurrent algorithm for the solution of the Navier-Stokes 
equations expressed in curvilinear coordinates has been developed 
for execution on a distributed memory parallel computer. This 
algorithm offers the ultimate promise of near-supercomputer per- 
formance on relatively low-cost parallel computers. The new 
algorithm is based on an existing serial pressure-correction-based 
algorithm, and partitions the problem onto the processors using 
areal decomposition. The algorithm is demonstrated on an Intel 
iPSC for a complicated two-dimensional laminar flow problem, 
for various grid sizes and numbers of processors. Speedup per iter- 
ation approaches 100% parallel efficiency as the grid size is 
increased. However, the convergence rate of the concurrent proce- 
dure tends to deteriorate somewhat relative to the original serial 
algorithm as the grid size and the number of processors are 
increased, limiting the maximum speedup that was achieved to a 
factor of 9.16 with 16 processors. The degradation in convergence 
rate is traced to a poorer solution of the pressure correction equa- 
tion that is obtained in the concurrent procedure, and several reme- 
dies are proposed. Overall, the results are very encouraging. 


1. INTRODUCTION 


A major obstacle to the increased use of computational fluid 
dynamics in engineering design continues to be the long run times 
and high cost of the computer simulations. As an example, a run 
of a 3-D finite volume combustor code developed at the GE 
Research and Development Center [1] requires one to two hours 
of cpu time on a Cray-XMP supercomputer, for a grid with 75,000 
grid points. Such a calculation would require hundreds of hours 
on a minicomputer or engineering workstation, making it imprac- 
tical for routine design purposes. More exotic codes such as those 
used for the direct simulation of turbulence require even greater 
computational resources. 


Parallel processing offers the promise of greatly reducing the 
execution times for CFD codes by using many processors to attack 
the problem simultaneously. Recent advances in VLSI technology 
have led to the development of a class of relatively low-cost con- 
current computers, which use a moderate number of inexpensive 
processors, that offer near-Cray performance at a fraction of the 
cost, making them appear attractive for more routine use. The com- 
putational power of the individual processors ranges from that 
of a 16-bit microcomputer, in machines such as the Intel iPSC, 
and Ametek System 14, to fast vector processors in machines such 
as the Alliant FX/8 and the Intel iPSC/VX. Other parallel super- 
computers with small numbers of supercomputer processors are 
available or under development (such as the ETA! and Cray 3) 
which offer higher ultimate performance, but their high cost and 
limited availability make them less attractive for routine use for 
the foreseeable future. 
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If low-cost processors are to be used, many will be required 
to provide the level of performance required for CFD codes. Con- 
current computers of this type can either be of a shared memory 
or distributed memory architecture. Memory conflicts can limit 
the number of processors that can be effectively used in a shared 
memory system. Distributed memory systems do not suffer from 
this limitation, and can be scaled up to hundreds of processors 
while retaining high parallel efficiencies. The disadvantage of dis- 
tributed memory architectures is that the problem to be solved must 
be explicitly partitioned by the programmer onto the various 
processors in such a way that load balancing is maintained and 
communication between processors 1s minimized. For some prob- 
lems, it may not be easy or even possible to find a satisfactory 
means of doing this partitioning. Fortunately, for finite volume 
fluid mechanics algorithms, a simple and natural geometrical par- 
titioning of the problem meets the requirements very nicely. Con- 
sequently, distributed memory architectures appear very well suited 
for the solution of CFD problems, as has been noted by other 
researchers [2,3], and are the focus of attention here. 


It is interesting to note that the evolution of both architectures 
appears to be toward hybrid systems offering aspects of both local 
and global memory, which will tend to blur the distinction between 
the two types of machines. Examples are the shared memory 
Flex/32 [2], which allows the memory to be allocated either locally 
or globally, and distributed memory machines like the experimental 
J-machine under development at MIT [4], in which the commu- 
nication between the processors is so rapid that a virtual global 
memory addressing system can be implemented. In any case, since 
the field of parallel processing is developing so rapidly, with 
machines appearing (and disappearing) daily, the concurrent 
algorithm developed here was designed to be generally applicable 
to the general class of distributed memory computers, rather than 
to a particular machine. 


In this work, the development of a concurrent algorithm for 
the solution of the Navier-Stokes equations expressed in curvilinear 
coordinates is described. First, the original serial algorithm is 
reviewed, and then the development of the concurrent algorithm 
is detailed. A theoretical analysis of the potential speedup availa- 
ble from the concurrent algorithm is followed by a demonstra- 
tion of the algorithm on an Intel hypercube for a complicated 2-D 
laminar flow problem. 


2. REVIEW OF ORIGINAL ALGORITHM 


The numerical algorithm developed here for execution on a 
distributed memory parallel processor is a direct extension of the 
serial incompressible flow algorithm described in [5,6]. Only a brief 
outline of the original algorithm is given here; the reader is referred 
to the original references for the details. For simplicity, the dis- 
cussion is limited to incompressible laminar flows, but the 


algorithm is extendable to turbulent, compressible, and chemically 
reacting flows. 


The governing conservation equations typically can be writ- 
ten in the Cartesian coordinates for the dependent variable ¢ in 
the following form 


0 0 0 0d 
— + — (evd) = —(T—) 
es (QuU¢) (ev¢) an ( a 


+ 2 7%) 4 RO») (1) 
dy 


oy 


where I is the effective diffusion coefficient and R is the source 
term. When new independent variables € and 7 are introduced, 
Eq. (1) changes according to the general transformation € = & (x, 
y),n = n (x, y). Equation (1) can be rewritten in (€, 7) coordinates 
as follows: 


Eo) S oie) 
J ag J an 
sy = 
rary F (qi9; an) 
1 oY 
pera e S (é, 2 
+ ee LF dob; + ands)|+ (€, 7) (2) 


where U and V are the contravariant velocity components, q), q2; 
and q3 are metric terms arising from the coordinate transforma- 
tion, J is the Jacobian of the transformation, and S(é, 7) is the 
source term in the £—y coordinates. 


A staggered grid system is adopted, following the standard prac- 
tice for incompressible finite volume algorithms. The scalar vari- 
ables (p, @) are located at the center of the control volumes and 
the Cartesian and contravariant velocity components are located 
on the faces of the control volumes. Discretization of Eq. (2), along 
with suitable interpolation for the variables whose values are 
unknown on the control volume faces, leads to the following 
general form of the conservation equation for the variable ¢: 


Lu 


i = E,W,N,S 


Apbp Abi + (Sp)p 3) 


The subscripts R.E,W,N, and S refer to the grid point at the center 
of the control volume and the four neighboring grid points, respec- 
tively. The term (Sy)p includes the original source term in the 
equation, plus the additional terms that cannot be approximated 
by the values of ¢ at the five grid points. A successive-line- 
underrelaxation method is used to solve the resulting finite differ- 
ence equations for each variable ¢. 


The momentum and continuity equations, along with appropri- 
ate boundary conditions, make up the complete description of a 
laminar flow. The solution of these coupled equations makes up 
the kernel of any computational fluid mechanics algorithm. Lami- 
nar flows are commonly used to test the performance of numeri- 
cal algorithms since the effects of the pressure-velocity coupling, 
which usually controls the convergence of the algorithm, are most 
clearly evident for such flows [7]. 
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The method used to solve the coupled system of momentum 
and continuity equations is a pressure-correction method similar 
to that described in [8]. Following the notation of Ref. [5], the 
momentum equations can be written as 


Abup = UL Atu; + DY + (BYpe + C%p,) (4) 
i = E,W,N,S 

Apvp = Le Aly; + D” + (BYpe + C’p,) ~— 5) 
i= E,W,N,S 


The Ds represent coefficients arising from the viscous cross- 
derivative terms, and the Bs and Cs represent the projected areas 
acted on by the pressure gradients in the € and 7 directions, respec- 
tively. The momentum equations can be solved, for a given pres- 
sure distribution p*, to yield a tentative velocity field u*, v*. Since 
u*, v* do not necessarily satisfy the continuity equation, they and 
the guessed pressure field p* must be updated. The corrected veloc- 
ities and pressure are obtained from: 


p=p* +p 
u=u* +u’ 
v=v*¥ + y! (6) 


Through manipulation of the momentum equations, and the for- 
mulas defining the contravariant velocities U, V in terms of u, v 
and the various metric derivatives, the velocity corrections U', V’ 
can be expressed in terms of p’, through the relations: 


tj? 


II 


(B"y,, = BYx,) D's 


a (7) 


II 


(C'x; + C"y:) D’, 


These expressions are then substituted into the discrete continuity 
equation: 


(eU)e — (eEU)w + (CEVY)n — (OV), = 0 (8) 


leading to the final form of the pressure correction equation: 


(QU*)o + (AQ)e (D'E — D'p) — (EU*)y 
— (aQ)y @'p — D'w) + (0V*)n + (BO)n (B'nN — D'P) 


— (eV*); — (Be); @'p — P's) = 9 (9) 


where a and @ are the coefficients derived by combining Eq. (7) 
and Eq. (8). This equation is solved, and the pressure and the veloc- 
ity components are updated, completing one global iteration. Due 
to the nonlinearity of the problem, a number of global iterations 
are required to obtain a converged solution. 


3. CONCURRENT IMPLEMENTATION 


Effective implementation of the algorithm described in Sec- 
tion 2 on a distributed memory concurrent computer requires the 
satisfactory resolution of three major computational issues, 
namely: 1) load balancing, 2) minimization of communications 
costs, and 3) the development of an efficient concurrent algorithm. 
The development of an efficient concurrent algorithm is the criti- 
cal step in making effective use of any parallel architecture. In the 
attempt to keep all of the processors busy generating floating point 
numbers at impressive combined mflop rates, it is easy to lose sight 
of the fact that the true goal is to achieve the same solution to 
the physical problem in less time than with a single processor. 


The concurrent algorithm described in this section was 
implemented on a 16-node, memory-enhanced Intel hypercube. 
Each processor is rated at only 0.03 mflop; consequently, ideal 
performance with a 16-node system is only about 0.5 mflops. Con- 
sequently, the emphasis here is not on the absolute performance 
of the code, but on the speedup obtained with multiple proces- 
sors relative to a single processor. The ultimate goal is to run the 
code on a machine with faster processors to achieve near- 
supercomputer performance. Although the basic concurrent 
algorithm is applicable to any distributed memory parallel proces- 
sor, details of the implementation motivated by the specific 
architecture of the iPSC are mentioned when appropriate. 


A widely used technique for partitioning the solution of a par- 
tial differential equation onto a number of processors is the areal 
decomposition method [9]. This method represents an extension 
of the classical Schwarz alternating method [10] to a parallel 
architecture. The solution domain is divided up into a number of 
overlapping subdomains, and each subdomain is assigned to a 
different processor. Overlapping is necessary so that each interior 
grid point is treated as an interior point in at least one subdo- 
main. In parallel, the coefficients of the equation are calculated 
in each subdomain, and an iterative solution is obtained in each 
region to some reasonable level of convergence. The boundary 
values are then exchanged with the neighboring subdomains, and 
the solution is iterated further. When some suitable global con- 
vergence criterion is satisfied, the solutions on each subdomain 
can be assembled into the complete solution for the entire domain. 


Since the areal decomposition method appears to be a natu- 
ral (and general purpose) way of partitioning problems involving 
the solution of pde’s by either finite volume or finite element 
methods for execution on a parallel computer, it was the method 
adopted here. The solution domain is divided up into a number 
of overlapping subdomains, one per processor, with the number 
of grid points in each subdomain approximately equal. Since the 
work required for each node point is roughly the same, this ensures 
that load balancing is excellent. 


In a curvilinear coordinate code, much of the storage burden 
is taken up by the storage of the grid point positions and the many 
metric derivative terms that arise from the coordinate transfor- 
mation. Since the grid is held fixed in the course of the calcula- 
tion, it is far more computationally efficient to compute the metric 
terms once and for all at the beginning and store the results, rather 
than recomputing the metric terms repeatedly throughout the cal- 
culation. A geometrical decomposition has the advantage that this 
metric information needs only be stored in the local memory of 
the processor that is assigned to that region of the domain, and 
does not need to be communicated between processors. Only the 
interfacial values of the solution variables, which are updated in 
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the course of the solution, need to be passed between processors. 
The only storage penalty that arises from the areal decomposi- 
tion method is due to the need to store the metric information 
and solution variables in the overlapping regions twice. 


The solution domain can be divided into strips, boxes, or 
arbitrarily shaped regions with roughly the same number of points. 
The key factor for minimizing the communications costs is to max- 
imize the ratio of computation in each processor to the commu- 
nication between processors. The computational work in each 
subdomain is dependent on the number of control volumes (vol- 
ume) of the subdomain, while the amount of communication 
between subdomains depends on the number of boundary cells 
(surface area) of the subdomain. For a large enough problem and 
a moderate number of processors, decomposition by strips leads 
to a smaller surface-to-volume ratio of the subdomains than does 
a boxwise decomposition. With the structured grid used in a finite 
volume formulation, there is no need to resort to arbitrarily shaped 
regions, although they are useful in a finite element context. The 
disadvantage of stripwise decomposition is that the number of 
processors that can be effectively used is limited to something on 
the order of nm, where n is the largest number of control volumes 
in any given direction. For both two- and three-dimensional prob- 
lems, 7 will seldom exceed 100; consequently, the number of proces- 
sors that can be effectively used is similarly limited. In 
three-dimensional problems, boxwise decomposition may be a bet- 
ter choice since it will allow a larger number of processors, and 
consequently more computational power, to be applied to the prob- 
lem. 


The best choice for the partitioning of the problem depends 
on the number of processors available and the ratio of computa- 
tional speed to communications speed for the machine under con- 
sideration. The Intel iPSC hypercube consists of an Intel 
80286-based host machine and a computational cube, consisting 
of up to 128 Intel 80286-based processors connected in a hyper- 
cube arrangement. All communication is slow relative to the speed 
of computation. With this number of processors and communi- 
cation speed, stripwise decomposition is a good choice for this 
machine. 


Due to the nature of the staggered grid used, two rows of grid 
points must be overlapped to ensure that all interior u velocities 
appear as interior points in at least one subdomain. This also causes 
one extra row of v velocities and pressures to be solved redundantly 
in each subdomain, which increases the computational effort for 
the concurrent algorithm somewhat over that of the original serial 
algorithm. 


There are a number of characteristics of the iPSC that influence 
the detailed coding of the algorithm required to achieve good con- 
current performance. There is a significant overhead associated 
with routing messages from one processor to the next, putting a 
premium on nearest neighbor communication. This overhead is 
overcome by mapping the hypercube to the required linear array 
through the use of binary reflected Gray codes [11]. Gray codes 
are sequences of n-bit binary numbers with the properties that any 
two successive numbers differ only in one bit, and all binary num- 
bers with n bits are included. Each processor is then a nearest neigh- 
bor to the two processors that are handling the two adjacent 
subdomains. As the code is implemented here, the host machine 
is used only to read in the grid file and the input variables, moni- 
tor convergence, and write out the final solution. Note that on 
the existing iPSC, all I/O must be done on the host machine. An 


efficient concurrent broadcast routine, contained in Intel’s proto- 
code library, is used to pass all of the input information from the 
host to each processor in the form of identical messages. In par- 
allel, each processor then extracts the portion of the grid file that 
it needs, and the input variables from these messages. This proce- 
dure is much more efficient than having the host machine sequen- 
tially create and send individualized messages to each node. The 
host machine is slow, and the cost of host-to-processor messages 
is high. The concurrent broadcast routine requires only one host- 
to-processor message, with all the other messages passed via a span- 
ning tree from processor to processor, with some message pass- 
ing occurring in parallel. Convergence is monitored by sending 
the mass residual to the host machine and comparing it to the 
prescribed convergence criteria. A partial mass residual is com- 
puted by each processor and summed together using a concur- 
rent concentration routine that forms the total mass residual and 
sends it to the host via a spanning tree, in a reverse manner to 
what was done in the concurrent broadcast. This convergence check 
is performed every fifth iteration to reduce the overhead of the 
node-to-host communication. 


The existing serial solution algorithm solves the governing equa- 
tions in the following order: x-momentum, y-momentum, and 
finally pressure correction. The same structure is retained in the 
concurrent implementation. In parallel, the coefficients for the 
x-momentum equation are computed for each subdomain. In par- 
allel, a few sweeps of an iterative, block-corrected, line-by-line solu- 
tion procedure [12] are performed in each subdomain. During this 
process, the boundary values for uv in each subdomain are held 
fixed. Next, the boundary values for u are exchanged between sub- 
domains, and the line-by-line solution procedure is repeated. The 
number of repetitions of the line-by-line solution and the exchange 
of the boundary values is prescribed by the user. 


Upon completion of the x-momentum equation, the procedure 
is repeated for the y-momentum equation, and finally for the pres- 
sure correction equation. This equation by equation procedure was 
selected for several reasons. First, it reduces to the original 
algorithm for a single processor. Thus, comparison of the speedup 
obtained with p processors over p subdomains is made relative 
to the original algorithm on a single processor over the entire solu- 
tion domain, which reflects a comparison of the parallel algorithm 
with the best serial algorithm, in the spirit of S’, as defined by 
Ortega and Voigt [13]. Second, if enough repetitions of the line- 
by-line solver and the exchange of boundary points are performed 
so that a converged solution is obtained for each equation before 
proceeding to the next equation, then the overall rate of conver- 
gence of the algorithm will be the same as for the serial algorithm 
if each equation is also solved to convergence at each stage. 
Although it is not the usual practice to solve each equation fully 
to convergence at each step, since the coefficients are only tenta- 
tive, this similarity suggests that the new concurrent algorithm will 
show comparable convergence behavior to the serial algorithm, 
which has proved to converge well for a large variety of problems. 


4. THEORETICAL SPEEDUP ANALYSIS 


Theoretical estimates can be made of the speedup that can be 
attained with p processors relative to a single processor for the 
current algorithm. The speedups achieved will be somewhat less 
than linear due to the following factors: 1) the additional com- 
putational work due to the overlapping regions, 2) the cost of the 
message passing, and 3) the reduction in convergence rate due to 
the change in the solution procedure from line-by-line over a sin- 
gle domain to the concurrent Schwarz alternating procedure. 
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Expressions for the degradation factors resulting from over- 
lapping and communication costs can be formulated by counting 
the number of arithmetic operations and messages passed as a 
function of the problem size and the number of processors. The 
overlapping of the subdomains causes more lines of control 
volumes to be solved than in the serial algorithm, increasing the 
computational work, and the cost of message passing represents 
an overhead not found in the serial algorithm. 


The computational effort in each subdomain is proportional 
to the number of control volumes in the subdomain, and can be 
expressed as: 


NI+p-1 


(topw)p = Kepu ( ) NJ (10) 


where k,p, represents the cpu time per control volume per itera- 
tion of the algorithm. Each processor exchanges essentially the 
Same mix of messages with its two adjacent processors at each 
iteration. On the Intel iPSC, the overhead for message passing 
is so high that messages less than 1024 bytes all take essentially 
the same amount of time to transfer [9]. Since most of the mes- 
sage traffic consists of short messages, the time per iteration that 
a processor spends passing messages can be expressed as 


(tin)p se Kim (2nm) (11) 


Here k,, represents the average time required to send and receive 
a short message, and n,, represents the number of messages 
exchanged with an adjacent processor per iteration. In this sim- 
ple analysis, the time processors spend idle waiting for messages 
that have not yet been sent, due to a lack of synchronization 
between processors, is included in this term. 


The ratio of cpu time per iteration, including both computa- 
tion and message passing, for the concurrent algorithm with p 
processors to the cpu time for the original serial algorithm becomes 


( NI+p-1 ) 
th _ Kepu P NJ + 2Kinnin (12) 
ts Kop NIeNJ 


The ratio of the total computational time for the concurrent 
algorithm to that for the serial algorithm is given by 


In — lolp (13) 
Ts ists 


where i, and i, are the number of iterations required to obtain a 
solution to the same level of convergence for the parallel and serial 
algorithms, respectively. After rearrangement, the following expres- 
sion for the speedup obtained with the concurrent algorithm with 
PD processors over the original serial algorithm on one processor 
is obtained: 


(1s) 


Ip 


$= 
(14) 


In a general sense, the speedup S can be expressed as: 


|? 


Here, the term e equals the parallel efficiency, which is less than 
unity due to the three factors listed above, namely the overlap- 
ping penalty O, the communication cost C, and the reduction in 
convergence rate (i;/Ip). 


S = ep =(-*) 
lp 


From (15) 
1+O4+C 


The important thing to note from the above expressions is that 
the penalties due to overlapping and communication become 
smaller and smaller as the problem size gets bigger, with every- 
thing else fixed. Hence, for a large enough problem, the only degra- 
dation from a linear performance improvement will result from 
any reduction in convergence rate that may result. Unfortunately, 
this degradation cannot be predicted analytically and must be 
determined from computational experiments. Note, that due to 
the particular formulation of the concurrent algorithm adopted 
here, the same convergence rate as the original serial algorithm 
can always be achieved by increasing the number of line-by-line 
sweeps. However, since this increases the cpu work per iteration, 
the total cpu time may or may not decrease. The optimum num- 
ber of sweeps for each variable for the concurrent algorithm must 
also be determined via numerical experiments, and is not neces- 
sarily the same as for the original serial algorithm. 


5. DEMONSTRATION RUNS 


A series of demonstration runs of the new concurrent algorithm 
was made on a 16-node, memory-enhanced Intel iPSC hypercube 
at the University of Lowell, MA. The concurrent version of the 
algorithm was first developed and tested using Intel’s hypercube 
simulator, running under Unix 4.2 on a SUN 3/160 workstation. 
Although it certainly would be useful to confirm the performance 
of the concurrent algorithm for a much wider range of flow con- 
figurations, the similarity of the concurrent algorithm to the serial 
algorithm, which has been widely tested, gives confidence that the 
limited results presented here will be representative of the perfor- 
mance of the algorithm in general. 


The test case selected involves steady laminar flow in the 
axisymmetric afterburner configuration shown in Figure 1. As 
described earlier, it is useful to study laminar flows since the solu- 
tion of the coupled momentum and continuity equations forms 
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the kernel of any computational fluid dynamics algorithm and 
can be most clearly studied in laminar flow. The solutions presented 
here are a first step towards a realistic afterburner simulation, which 
will include equations for turbulence and combustion that can be 
solved by the same basic procedure. 


Due to limitations on cpu time, very large grid sizes could not 
be attempted. Two body-fitted grids, one with 32 x 20 nodes, and 
the second with 64 x 20 nodes, were used in the calculations. Both 
grids are shown in Figure 2. For reference, the converged solution 
for the fine grid is shown in Figure 3, by means of plots of the 
calculated velocity vectors and the streamlines. Although the flow 
is laminar, the flowfield is not simple and contains wake regions 
behind the flameholders and a recirculation zone near the trail- 
ing edge of the centerbody. 


Although the performance of the original serial algorithm is 
affected by the choice of such parameters as underrelaxation fac- 
tors for the velocities and pressure, and the number of line sweeps 
for each variable, previous studies [14] have shown the sensitivity 
is not that great, provided reasonable values are used. In this work, 
no attempt has been made to optimize these factors for each run; 
rather, reasonable values of these parameters were held fixed for 
all runs. The underrelaxation factors for the x- and y-momentum 
equations were taken equal to 0.3, and that for pressure was taken 
equal to 0.5. Two sweeps of the line-by-line procedure were used 
for both x- and y-momentum, with three sweeps taken for pres- 
sure correction. 


The number of iterations required for. convergence is depen- 
dent on the choice of convergence criterion. Here, the solutions 
were taken to be converged when the normalized mass residual 
was less than 1073. It was confirmed that the converged solutions 
obtained with the concurrent algorithm were independent of the 
number of processors used, and identical to those obtained with 
the original serial algorithm. 


5.1 Results 


As discussed earlier in Section 4, as the grid size is increased, 
the speedup per iteration of the algorithm should approach a lin- 
ear speedup with the number of processors, since the penalties 
associated with overlapping and communication become less sig- 
nificant. Figure 4 demonstrates that the speedups obtained on the 
test problem for the two grid sizes exhibit this trend. With 16 
processors, the maximum speedup per iteration increased from 
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Figure 1. Axisymmetric afterburner configuration. 
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Figure 2. Computational grids for test problem, a) coarse 32 x 
20 grid, b) fine 64 x 20 grid. 
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Figure 4. Speedup per integration for concurrent algorithm on 


Intel iPSC. 
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Figure 3. Converged solution for the fine grid, a) velocity vec- 
tors, b) streamlines. 


11.12 for the coarse grid to 12.84 for the fine grid, representing 
parallel efficiencies of about 70% and 80%, respectively. The major 
contribution to the efficiency being less than 100% comes from 
the overlapping of the subdomains, rather than from the commu- 
nications costs, which appear minimal. 


However, the speedup in terms of total computational time, 
shown in Figure 5, does not show the same improvement as the 
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Figure 5. Speedup of total time for concurrent algorithm on Intel 


iPSC. 


number of grid points is increased, due to a degradation in the 
convergence rate of the algorithm as the number of grid points 
and the number of processors is increased. The convergence paths, 
as a function of the number of processors, are shown for the coarse 
grid and the fine grid in Figures 6 and 7, respectively. Note that 
for both grids, the original serial algorithm (1 node) shows a 
smooth monotonic reduction in the mass residual, reaching an 
asympototic rate of convergence that is substantially less than the 
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Figure 6. Convergence behavior of concurrent algorithm on coarse 
32 X 20 grid. 
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Figure 7. Convergence behavior of concurrent algorithm on fine 
64 x 20 grid. 


initial convergence rate. The concurrent algorithm shows a nois- 
ier convergence path with a slower initial rate of convergence, but 
with a similar asymptotic rate of convergence. With 16 processors, 
the convergence rate was 24% less for the coarse grid, and 40% 
less for the fine grid, than the corresponding rates for the serial 
algorithm. 


Earlier studies of the serial algorithm have shown that the ellip- 
tic nature of the pressure correction equation makes it more dif- 
ficult to converge than the momentum equations, which tend to 
be convection-dominated [15]. Schwarz’s alternating method is also 
known to be slowly convergent for highly elliptic problems when 
many subdomains are used [16]. A series of runs was made vary- 
ing the number of line sweeps for the pressure correction equa- 
tion, for the fine grid test case using the concurrent algorithm with 
16 processors. Figure 8 shows that the convergence rate of the con- 
current algorithm approaches that of the serial algorithm as the 
number of pressure correction sweeps is increased, as expected. 
This indicates that the reason for the slower convergence of the 
concurrent algorithm in the original case was a poorer solution 
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of the pressure correction equation with the same number of sweeps 
used in the serial procedure. It is interesting to note from Figure 
9 that the total computational time required by the concurrent 
algorithm is relatively insensitive to the number of line sweeps used 
for pressure correction, provided that at least three sweeps are per- 
formed. Any improvement in convergence rate achieved by per- 
forming more than five sweeps is more than offset by the increased 
cost per iteration that results. 


6. CONCLUDING REMARKS 


A concurrent algorithm for the solution of the Navier-Stokes 
equations expressed in curvilinear coordinates has been developed 
and demonstrated on a distributed memory parallel processor. The 
speedup per iteration approaches 100% parallel efficiency as the 
grid size is increased. However, a reduction in the convergence rate 
of the algorithm as the grid size and the number of processors 
are increased, caused by a poorer solution of the pressure correc- 
tion equation, limits the speedup in terms of the total computa- 
tional time relative to the original serial algorithm to less than 
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Figure 8. Effect of number of line sweeps for pressure correction 
equation on number of iterations required for conver- 
gence (64 x 20 grid, 16 processors). 
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Figure 9. Effect of number of line sweeps for pressure correction 
equation on total time required by concurrent algorithm 
(64 x 20 grid, 16 processors). 


linear. Despite this, the algorithm looks very promising, with the 
potential for giving near-supercomputer performance on a dis- 
tributed memory machine with faster processors. 


Further work is needed in the following areas: 


1) Block correction or multigrid techniques should be investigated 
to improve the parallel solution of the pressure correction equa- 
tion, in an attempt to minimize the degradation in convergence 
rate observed here for large grids and large numbers of proces- 
sors. Multigrid methods have been successfully implemented 
on hypercubes [17], and have been shown to improve the con- 
vergence of the pressure correction equation for serial 
algorithms [18]. 


2) Aconvergence criterion should be used for the solution of each 
individual equation, rather than a fixed number of line sweeps, 
to ensure the best tradeoff between cost per iteration and overall 


convergence rate. 


~~ 


More test problems should be studied to verify the performance 
of the concurrent algorithm under a wider range of conditions. 
Problems involving turbulence and chemical reactions should 
be included in the study. 


3) 


4) The concurrent algorithm should be run on a distributed mem- 
ory machine with vector processors, such as Intel’s iPSC/VX, 
to measure the speedup relative to a single processor and the 
total computational time. A successful demonstration will prove 


the practical utility of such machines for running CFD codes. 
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Abstract -- Electron transport simulators are 
important tools for studying electrical properties of 
semiconducting materials and devices. As demands for 
modeling more complex devices and new materials have 
emerged, so have demands for more processing power. 
This paper documents a project to convert an electron 
transport simulator (MOCASIN 2.0) to a parallel pro- 
cessing environment. In addition to describing the 
conversion, the paper presents PPL, a parallel program- 
ming version of C running on a Sequent multiprocessor 
system. In timing tests, models that simulated the 
movement of 2,000 particles for 100 time steps were 
executed on ten processors, with a parallel efficiency of 
over 97%. In this revision to MCC Technical Report 
ACA-ST/CAD-328-87, an additional table has been 
added to explain the apparent discrepancy in the timing 
results found in Table 1. 


Introduction 


Device simulators play an important role in the 
characterization and advancement of semiconductor 
technology. They are used to characterize and predict 
the electrical behavior of various devices, such as 
transistors and diodes, fabricated from different materi- 
als including germanium, silicon and gallium arsenide. 
Currently, the most common device simulators are 
based on the drift-diffusion approximation [7, 8, 12]. 
Drift-diffusion based simulators enjoy the properties of 
familiarity (due to widespread use), rapid execution 
(due to the simplifying approximations made about the 
physical properties of the devices), and reliable predic- 
tions for a large class of problems. Unfortunately, the 
drift-diffusion approximation becomes less accurate as 
device sizes shrink, as the electric field inside of the 
devices becomes stronger, and as the electric field varies 
rapidly in space or time. These trends are becoming 
more evident as semiconductor technology advances. 


Device simulators that model transport of a carrier 
(an electron or a hole) at the scattering level and that 
use Monte Carlo simulation remove several of the 
assumptions made in the drift-diffusion simulators 
about device behavior [5, 6]. This new form of simula- 
tor has several advantages over the drift-diffusion based 
simulators, including: (1) they can be easily modified to 
include new scattering processes; (2) they can model 
intrinsic noise phenomena; and (3) they represent the 
physical behavior more accurately. However, a disad- 
vantage is that it can require enormous amounts of 
CPU time; runs requiring hours or days of computing 
time on a Sun 3/260 workstation are common. 


At MCC, a program to model electron transport in 
planar MESFET’s (MEtal Semiconductor Field Effect 
Transistors) was developed as part of the Computer 
Aided Design (CAD) project. This program, named 
MOCASIN 2.0 [2], was implemented for the the Apollo 
660 workstation and the Sun 3 workstation [BuBI86] 
over a two year period. In the middle of 1987, a colla- 
boration between the CAD project and the Parallel Pro- 
cessing project at MCC led to the development of a 
parallel version of MOCASIN 2.0 which runs on the 
Sequent Balance multiprocessor system. This paper 
describes the the implementation of parallel MOCASIN 
2.0, with emphasis on the changes required to convert 
the serial version to the equivalent parallel version. A 
key factor in facilitating this conversion was PPL [9, 10, 
11], a superset of C with enhancements to facilitate 
parallel processing. 


The Simulator - MOCASIN 2.0 


MOCASIN 2.0 is a two-dimensional, scattering- 
level transport simulator for planar MESFET’s fabri- 
cated from III-V compounds. The program places 
charged particles (representing the carriers) in a rec- 
tangular simulation region (representing the MESFET) 
and solves for the electric field due to charged particles, 
ionized dopants, and boundary conditions. The parti- 
cles are moved according to the local electric field and 
scattered according to local material properties. After 
each particle has been moved for a very short time 
interval (5x10* seconds), the electric field is updated. 
These two steps - field determination and particle tran- 
sport - are repeated until the device behavior has been 
simulated for the specified time interval. 

The carrier transport process also consists of two 
stages. In the first stage, each particle is moved under 
the influence of the local electric field for a randomly 
selected length of time. During the second stage, a 
scattering process is randomly selected from a set of 
possible processes, and the particle’s energy-momentum 
state is randomly altered according to the selected 
scattering process. 


This procedure is very general and uses the more 
accurate approximations mentioned above. In particu- 
lar, drift-diffusion based simulators assume that the car- 
riers instantaneously reach a mean velocity determined 
by the local electric field. This assumption prevents 
drift-diffusion based simulators from accurately predict- 
ing the electrical behavior of submicron sized 
MESFET’s in Gallium Arsenide. Scattering-level based 
transport simulations, such as MOCASIN 2.0, model 
the experiment more accurately, but they require much 
more CPU time. 


To address the problem of long running times, a 
project was established to implement a parallel version 
of the simulator. The first step of this project was to 
select a portion of the simulator code to execute in 
parallel. The next steps were to define a process struc- 
ture and redefine the data structures for parallel execu- 
tion. After implementation of the parallel version, tests 
were conducted, to verify correct operation and to 
measure the impact of the changes. 


The section of the simulator chosen as the portion 
of the simulator to be converted to parallel execution 
was the particle transport section, for three reasons: 


e it was easily isolated, 

e it consumed a major percentage of the CPU 
time in the serial version, and 

e the movement of each particle is independent of 


the movement of all of the other particles. 
As will be seen later on, this selection was fortuitous. 


Once this selection was made, MOCASIN 2.0 was 
moved to the parallel environment used for this project. 
This environment was the Sequent Balance 8000 mul- 
tiprocessor system [13], together with PPL, a parallel 
programming language based on C. PPL had been 
developed at MCC for the Balance system and, based 
on prior experiences [4], seemed to be a good choice for 
the conversion project. The remainder of this paper 
gives a brief overview PPL and then details the major 
changes required to arrive at a parallel version of the 
simulator. 


Overview of PPL 


As stated above, PPL is a superset of C. The 
extensions all deal with creating and controlling multi- 
ple (parallel) processes within a single C program. The 
primary construct is the PPL process. PPL processes 
correspond to tasks or microtasks in some other parallel 
environments. The underlying runtime support pack- 
age uses UNIX} processes (created using the fork system 
call) to serve as ‘‘workers’’. In the remainder of this 
paper, the term ‘‘process’’ will refer to a PPL process, 
and the term ‘‘worker’’ will refer to a UNIX process 
which is associated with a processor. PPL processes 
which are eligible to execute are placed on a single 
‘‘ready-to-run’’ queue. As processes are placed on this 
queue, the idle workers are stimulated to examine this 
queue; one of the idle workers will remove the next eli- 
gible process and start its execution. PPL is designed 
to encourage the use of processes, and it completely 
hides the existence of the workers. The only control a 
programmer has over the workers is to specify the 
number of workers dedicated to the program. Each 
worker uses one physical processor, so the number of 
workers corresponds to the maximum level of parallel- 
ism which could be exploited by the program. A PPL 
program can have many simultaneously active processes 
with only a few workers. In fact, executing a PPL pro- 
gram with only one worker assigned (and consequently 


+ UNIX is a trademark of Bell Laboratories. 
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with all processes executing sequentially on one proces- 
sor) is a useful debugging technique. 


Processes are created dynamically. There are no 
implied relationships between processes. Processes can 
terminate any time after they are created. In addition, 
they can suspend execution, to await the occurrence of 
an event caused by another active process. Processes 
can initiate additional instances of themselves or other 
processes. The input arguments and variables local to a 
process are automatically preserved as the process 
progresses through alternating periods of being active 
and then being suspended. Processes can call functions 
and procedures, just as normal C procedures do. The 


PPL programmer is able to concentrate on implement- 
ing the parallel program, while the details of process 
management and process scheduling are handled by 
procedures in the PPL runtime library. 


In an environment with simultaneously active 
processes, there must be mechanisms for synchronizing 
the activities of these processes. Often, these synchron- 
ization mechanisms are used to guarantee that only one 
process at time is active in a critical section of the code. 
A classic example would be updating a variable which is 
located in shared memory (see below). PPL supports 
two kinds of synchronization operations: those based on 
suspending and resuming processes and those which use 
a “busy wait’’. The first kind has the advantage that 
while a process is in the suspended state in some queue 
of suspended processes, the worker it was using can be 
assigned to another eligible process. However, there is 
some processing overhead required to place a process in 
this queue and to start another process. The second 
kind has very low overhead, but it does ‘‘waste”’ a 
worker while the process is waiting for something to 
happen. PPL programmers are encouraged to use the 
suspend-resume form of synchronization if there is a 
high probability of being forced to suspend or if the 
process will be in the critical section for a long period of 
time. The busy-wait form is appropriate if there is a 
low probability of being forced to wait or if the time in 
the critical section will be short. 


While PPL has several different kinds of synchroni- 
zation mechanisms, the only ones used in the parallel 
version of MOCASIN 2.0 are locks, a mailboz, and a 
counter. On the Sequent, locks are busy-wait mechan- 
isms which are implemented using the test-and-set-bit 
instruction and an array of lock bits. PPL provides 
statements to declare and initialize locks and to lock 
and unlock these locks. Mailboxes and counters are 
both suspend-resume synchronization mechanisms. A 
mailbox is used to coordinate the exchange of messages 
between processes. A process can send a message 
(either an number or a string of bytes) to a mailbox. 
Another process can do a recezve operation on that 
mailbox. If there is a pending message, the receiving 
process will consume the message and continue; if there 
are no pending messages, the receiving process will be 
suspended, until a message arrives at the mailbox. A 
counter is often used by a parent process to wait until 


all of the child processes have completed their process- 
ing. A counter used in this fashion is initialized to an 
integer value (the number of children to be initiated). 
After the children are initiated, the parent suspends by 
performing a c_wait operation on the counter. As each 
child process completes, it does a c_set operation on the 
counter. As the last child does this, the parent process 
resumes and continues execution. 


In the parallel version of MOCASIN 2.0, several 
locks are used to guarantee that only one process at a 
time updates a variable in shared memory. A mailbox 
is used to implement a queue of particles which need to 
be moved during a time step, and a counter keeps up 
with the number of moved particles, so that the parent 
can resume processing after all of the particles have 
been moved in a single time step. 


The Sequent system provides a (logical) segment of 
memory which can be shared by all of the processes 
associated with a program. The C compiler provided 
by Sequent has been modified, to accept shared as part 
of the declaration of a variable. Variables declared as 
shared are located in the shared segment and can be 
accessed by PPL processes. Thus, there are three kinds 
of storage which can be used by PPL programs: storage 
which is globally defined and located in the shared seg- 
ment, storage which is globally defined, but located in 
the private segment of each worker, and storage which 
is automatically allocated on the runtime stack as each 
process or procedure is entered. Placing data in the 
correct class of storage is a crucial consideration in the 
design and implementation of a parallel program. 


The Parallel Version of MOCASIN 2.0 


As stated earlier, it was decided to parallelize that 
portion of MOCASIN 2.0 which moved the particles 
within a time step, and to leave the portions dealing 
with determining the electric field and advancing to the 
next time step in serial mode. It was hoped that by res- 
tricting the portion to be modified, the anticipated per- 
formance enhancements could be quickly obtained. 


MOCASIN 2.0 can be viewed as a main program 
that contains a time-step for-loop. (See Figure 1). This 
outer loop steps time from zero to the final simulation 
time. The electric field is determined at each time step 
as a function of the charge locations and boundary con- 
ditions. Then each particle is moved in the local 


electric field. In the serial version, this was accom- 
plished with a particle for-loop that steps through an 
array containing each of the simulated particles. This 
particle loop was the only section of MOCASIN 2.0 to 
be parallelized. The particle loop was parallelized by 
sequentially assigning the movement of each particle in 
turn to the next free processor. The lack of interaction 
between particles during the movement phase vastly 
simplified the serial-to-parallel conversion. 
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MAIN: 
FOR EACH TIME STEP { 
DETERMINE ELECTRIC FIELD; 
MOVE_PARTICLE PROCEDURE { 
FOR EACH PARTICLE 
PERFORM ONE MOVE STEP; 


} 
} 


Figure 1 
Structure of MOCASIN 2.0. 


Two kinds of changes were made to parallelize 
MOCASIN 2.0. One was to convert from serial execu- 
tion to parallel execution, to take advantage of the 
architecture of the Sequent. These modifications were 
accomplished using the constructs provided by PPL [9]. 
The parallel version was modified still further, to insure 
that repeated simulations of the same model would 
yield identical results. The possibility for different (but 
correct) results comes from the use of a single stream of 
random numbers in the serial version. In the parallel 
version, it is likely that particles would be moved in 
different ways on repeated executions of the program, 
because different sequences of random numbers would 
control each particle on successive runs. An alteration 
of the parallel version (see below) removed this kind of 
variability and guaranteed repeatable behavior. While 
not essential to correct operation of the model, these 
changes greatly simplified verification of the parallel 
version. Both kinds of changes could be implemented 
and tested on a single processor machine, in the PPL 
environment. 


As indicated in Figure 2, the user-requested 
number of mover processes is created prior to entering 


the time step loop. Then, the time-step loop is entered 
and, after the electric field is determined, the mailbox is 
filled by sending it a sequence of messages, each con- 
taining the index of a particle that must be moved. 
Control is turned over to the mover processes until they 
signal completion. This parallel phase replaces the 
serial for-loop control structure. When all of the parti- 
cles have been moved, the main procedure regains con- 
trol and the next time step is started. 


Another change required for parallelization was to 
insure that all variables that were being altered by the 
particle moving processes were located in shared 
memory. Some of the data structures in MOCASIN 
had to be relocated to shared memory, to enable correct 
operation of the parallel version. Conflicts can arise 
when multiple mover processes try to simultaneously 


modify data values in that part of main memory which 


is shared by all of the processes. These conflicts are 


resolved by using locks. 

Some additional changes were required, to insure 
that the parallelized version of MOCASIN 2.0 would 
give reproducible results for different runs that used the 
same random number seed. There were two such 


changes. The first was to associate an individual ran- 
dom seed with each electron being simulated. With this 
change, each particle was guaranteed to be moved in 
the same way, regardless of the temporal ordering of 
events in successive runs of the program. This is simi- 
lar to the technique reported by Fredrickson, et al [3]. 


MAIN: 
CREATE MOVER PROCESSES; (one per processor) 
FOR EACH TIME STEP { 
DETERMINE ELECTRIC FIELD; 
MOVE_PARTICLE PROCEDURE { 
FOR EACH PARTICLE 
SEND MESSAGE (particle index); 
WAIT FOR COMPLETION 
OF ALL PARTICLES MOVES; 


} 
i 


MOVER PROCESS: 
DO FOREVER { 
RECEIVE MESSAGE (particle index); 
PERFORM ONE MOVE STEP; 
SIGNAL COMPLETION OF ONE PARTICLE; 


Figure 2 
Structure of Parallel Version of MOCASIN 2.0. 


The second change was to modify the procedure 
used to delete particles that left the simulation region 
through a contact. In the serial version, each particle 
was deleted from the array of active particles and 
replaced with a particle from the end of the array, 
which kept the array in a compact form. When execut- 
ing in parallel, this technique would have meant that 
the order of the particles in the resulting array would 
have been different each time the parallel version was 
executed. The parallel version was changed so that as a 
particle was deleted from the region, its index was 
added to a list. At the end of the moving phase, this 
list was sorted, the particles then deleted in order, and 
the ‘‘holes’’ in the array filled as before. In addition to 
preserving the order of the particles in the array, this 
technique was required for two other reasons: (1) using 
the list allowed the program to postpone deleting parti- 
cles until the end of the transport phase so that the size 
and structure of the array did not change during the 
parallel phase of execution, and (2) sorting the list of 
deleted particles meant that they would be removed in 
descending order by indices and the delete procedure 
from the serial version could be reused. 


Timing Results 


Correct operation of the parallel version of 
MOCASIN 2.0 was verified by comparing the results 
with those from the original serial version. The new 
results were not identical, because of the different order- 
ing of movements caused by the new procedure for gen- 
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erating random numbers. However, these results where 
judged to be valid, based on a thorough analysis of 
them. Next, performance of the parallel version was 
measured using “‘N’’ processors, with N ranging from 1 
to 10. These tests simulated the same MESFET with 
two different sets of parameters [2], as follows: 

e A short set of parameters, with 100 electrons, for 

10 time steps, and 

e A long set of parameters, with 2000 electrons, for 

100 time steps. 

As mentioned earlier, the system used in the pro- 
ject was the Sequent Balance 8000, at MCC. This sys- 
tem has 12 CPU’s; each CPU has a NS32032 (10Mhz) 
processor, a floating point coprocessor, a memory 
management unit, and 8 kilobytes of local cache. The 
size of the main memory is 16 megabytes, and all of 
this memory can be accessed by any processor. The 
local caches employ a write-through strategy, and spe- 
cial circuitry insures that the contents of each cache 
remain consistent, even when cached data is being 
updated by another processor. 


Some results from the experiments on the Sequent 
are shown in Table 1. The speed up factor in Table 1 
is the elapsed time with N = 1 processors divided by 
the elapsed times for other values of N. 


eed __ Short Problem Long Problem 


OO i Ld = crn 
Time Time 

1.0 171,422 1.0 

85,310 

43,334 


21,943 
17,512 


Table 1 
Elapsed Times (Seconds) and Speed-up Factors on Sequent 


A common way of evaluating parallel speed up is 
to compare the elapsed times of the parallel version to 
the elapsed time of the purely serial version. When this 
was done with the two versions of MOCASIN 2.0, the 
serial version required more time than the parallel ver- 
sion with one processor (by about 10 to 20 per cent). 
This result was counterintuitive and more studies were 
performed, to try to understand the cause. Additional 
analysis pointed out that the user mode times for the 
serial version and the parallel version were about the 
same, but that the system mode time for the serial ver- 
sion was much greater than for the parallel version. 
Some detailed runtime statistics showed that the 
number of page reclaims (page faults which were 
satisfied without doing any I/O) for the serial version 


was nearly a factor of between 5 and 10 greater than 
the number of page reclaims in the parallel version. In 
addition to introducing parallel processes, the parallel 


version also used shared memory, while the serial ver- 
sion used unshared memory. The data in Table 2 sum- 
marizes these findings. It is fair to conclude that in the 
version of the Sequent operating system being used for 
these tests, page replacement is biased in favor of keep- 
ing pages in shared memory over keeping pages in 
unshared memory, to such a degree that comparisons of 
running times for programs using shared memory and 
programs using nonshared memory can be invalid. 


991 963 
1010 
1111 
1323 


1434 


24 
20 
28 
39 
49 


15,767 
14,769 
18,931 
18,327 
16,861 


023 
294 
181 
159 


Table 2 
Times (Seconds) and Page Faults for Short Problem 


Summary 


The goals of the project were met. First, a C pro- 
gram written for a single processor system was quickly 
(about two man weeks) converted to run on a multipro- 
cessor system. Second, measurements made on the 
parallel version showed that this program could make 
very efficient use of multiprocessor system. That fact 
that the program, executing the large model, ran 9.7 
times faster on a 10 processor system demonstrates 
efficient performance. 


The project, while achieving the stated goals, also 
raises some additional issues. For example, how critical 
is the fact that in this problem, the particle movements 
were all independent from each other (in the same time 
step)? Could an equally efficient implementation be 
done, even when there are some dependencies between 
particles as they are moved? Another issue involves the 
style of the parallel version of MOCASIN 2.0. The 
current parallel version uses the simple technique of 


sending a message for each particle which needs to be 
moved. There are other techniques, such as assigning 
blocks of particles to each of the mover processes at 
each time step, which would eliminate the need for 
sending 2000 messages (in the long problem) at each 
time step. Also, there are probably other approaches 
which would eliminate the sort (of the deleted particles) 
at the end of each time step while maintaining a repro- 
ducible set of results. Such a change could reduce the 
CPU time required at each step. 


Finally, there was no attempt to extrapolate the 
performance efficiencies which would be obtained with 
increasing numbers of processors. The current project 
was limited to ten processors by the system being used. 
It would be interesting to continue the study by 
increasing the number of processors. In the extreme 
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case, how would a 2000 processor system handle this 
problem? How does efficiency ‘‘tail off’ as the number 
of processors increases? 


The current project was important in one other 
respect, namely it provided a concrete example of a use- 
ful parallel program. The practicality of parallel pro- 
gramming, to solve scientifically interesting problems, 
has not yet been generally established. This project is 
another step in the process of discovering both the use- 
fulness and the limitations of parallel processing. 
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Abstract 


In this paper we discuss the issues involved in im- 
plementing a general dynamic program on a data paral- 
lel computer to compare proteins for good subsequence 
matches, based on a variety of scoring metrics. A stan- 
dard serial algorithm can be optimally parallelized. Care- 
ful allocation of machine resources has enabled us to 
compare an entire database of 2000 proteins against it- 
self in about the same time that it would take to run 
one protein against the database using conventional com- 
puters. The results gleaned from this program provide 
information about scoring metrics and allow clustering 
of groups of related proteins. This information can be 
of assistance in determining the biochemical function of 
some proteins. 
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Introduction 


In many fields, including molecular biology, speech recogni-. 
tion, and cryptology, there are problems whose solutions in- 
volve comparing sequences of symbols to determine correlations 
between them. Often, this task involves finding an optimal 
alignment of the sequences in question, which may require in- 
troducing gaps into one or both of these sequences. One way 
of measuring the optimality of an alignment is by computing 
a score based on a matrix of weights reflecting the similarity 
between pairs of symbols. In some situations a penalty is sub- 
tracted for each gap introduced. Such a score can be computed 
by a dynamic programming algorithm in time proportional to 
the product of the lengths of the sequences [1,12]. 


In biology, this technique is particularly useful. Although 
it is possible to determine the amino acid sequence for virtu- 
ally any protein, there is no general method for determining 
the protein’s biochemical function from this information alone. 
One of the most successful approaches has been to find a simi- 
larity between a subsequence of a newly sequenced protein and 


one of a protein of known function [5]. However, because com- 


parison of a single protein to a database of 4,000-5,000 proteins 
using the complete dynamic programming algorithm can take 
several hours on conventional computers, biologists frequently 
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employ more approximate, heuristic methods [11]. Also, be- 
cause of the computational difficulty of performing exhaustive, 
studies of many alternative scoring systems, the few scoring 
systems that are typically used are relatively ad hoc. Only a 
few computationally intensive studies, using a single scoring 
system, have been done [3,14]. 


In order to examine the methodology underlying these scor- 
ing systems more completely and to initiate a systematic search 
through the databases for statistically significant correlations 
between pairs or groups of proteins, we have implemented a 
parallel version of this dynamic programming algorithm on the 
data parallel CM-2 Connection Machine® system. Through an 
analysis of the data layout and careful coding of the inner loop 
we have written a program which enables us to run every pro- 
tein in a database of 2000 proteins against the entire database 
in under an hour. 


_ In this paper, we discuss the implementation issues involved 
in this application. We also mention preliminary results from 
our systematic investigation of some scoring systems. Section 
2 discusses the architecture of the CM-2 Connection Machine 
system, Section 3 reviews the basic dynamic programming al- 
gorithm for subsequence comparison, and Section 4 indicates 
how it can easily be parallelized. Section 5 contains a detailed 
description of the implementation on the CM-2, Section 6 dis- 
cusses generalizations of the algorithm in Section 3, and Section 
7 contains timings and a brief description of some of the results 
obtained using this software. 


2 The CM-2 Architecture 


The algorithms for this project were programmed on the Con- 
nection Machine CM-2 data parallel computer [4,7]. The CM- 
2 is composed of a microsequencer and a maximum of 64K 
single-bit processing elements. The processors run in SIMD 
mode, with the instruction stream broadcast by the sequencer. 
The sequencer is controlled by an external front end machine, 
usually a SYMBOLICS® Lisp Machine, or VAX® . Parallel 
extensions of familiar programming languages allow the user 
to program the front end to do serial computation, with an 
expanded instruction set providing access to the “data paral- 
lel” part of the system. Each processor is associated with 64K 


bits of RAM, and there is a single high-speed floating point 
unit for every 32 processors. The processors are connected in a 
boolean n-cube topology, however the communication system 
is general enough that an arbitrary connection scheme can be 
imposed by the user. Thus, data can be rapidly exchanged 
between the memories of different processors as necessary to 
complete any computation. One very communication efficient 
way to configure the machine is as a k dimensional grid, which 
is automatically superimposed by the system software onto the 
boolean cube using a multidimensional gray-code mapping. In 
this paper we will deal with the CM-2 as if the processors were 
configured in a 1-d grid. In this situation every processor has a 
unique identifying coordinate denoted by p, such that processor 
p only communicates with processors p— 1 and p+ 1, where p 
ranges from 0 to N — 1, where N is the number of processors 
in the CM-2. 


3 The Subsequence Matching Problem 


The subsequence problem can be formulated as follows: 


Given two sequences A, B, of symbols chosen from a domain 


F 


A= (a1, a2, ee inl) Bz (by, be, .. és Din) 
find the subsequences 


aj, b; ef, 


A! = (Gis Gizges oye, ); B' = (b5, 5 bjg50++58;,) 


[where] <n Cig <1... Si, n, 1S ji < ja <... < je < m) which 
maximize the comparison function CA’, B’). C can depend on 
the symbols a;,, 6;, in A’ and B’ and on the numbers of symbols 
in A and B which are omitted between successive symbols in 
A’ and B’ (gaps). 


We denote by oj = i141 — 7 -—1,% = jk+1 — Jk —1, the 
“gap” sizes between aj, and a;,,,, bj, and b;,,, respectively, for 
1<I1,k < z. Let A’ ~ A denote that A’ is a subsequence of A, 
and define A(;) to be the subsequence of the first 7 elements of 


A, i.e., A(3) = (a1,---, 43). 


In this paper, we consider primarily scoring systems which 
are defined recursively as follows: 
C(A’, B’) = C(A(,)) Bi.) 


oe C(A(,-1)> B(z-1)) cg D(aj,,53,) +g: (o,-1 So Tz1) 


2 
=D) D(ainsbj,) +9: (te - 1 — 24+ 52-51 —2 +2), 
k=1 
where the gap constant g < 0, and D is a correlation function 
between single elements of F. 


For such comparison functions, one can use a dynamic pro- 
gramming algorithm to determine the best subsequence match 
for a given pair of sequences A, B in serial time O(mn) where 
n and m are the lengths of the sequences A and B. 


This dynamic programming algorithm can best be under- 
stood by considering the matrix 


C5 = max{C(A’, B‘)+9-(r—-i,+8- jz) 
|z, < PJe < s}, 
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Where max is taken across all A’ = (a;,,...,a:,), B’ 
(Biz 5 ++ +5d5,)- 
It is then clear [13] that 


0 

Croiiest D(a,, bs) 
Cr-1,8 =e g 

Crs-1 SE g 


C,,5 = max 


Thus, to compute max, gi{C(A’, B’)|A’ < A,B’ < B}, one 
need only compute maxC’,, where the matrix elements C,., 
can be computed inductively in time O(mn). 


4 Parallel Implementation 


A parallel version of the dynamic programming algorithm is 
quite straightforward to derive [see for example [6]]. Since com- 
puting the value of C,,, only depends on knowing the values 
of C,_1,5, Cr,.—1, and C’,_1,,-1, we see that all of the elements 
on one anti-diagonal of the matrix can be computed simulta- 
neously if the values along the two previous anti-diagonals are 
known. That is, for a fixed value of t, the matrix elements 
Cr—s,5 can be computed simultaneously for all s provided that 
they are known for t — 1 and ¢ — 2. Thus, one can parallelize 
the above algorithm by computing successive anti-diagonals of 
the matrix C’,,, on successive time steps. This is represented 
schematically in Figure 1. The algorithm requires n + m — 1 
time steps and m processors to compare proteins of length m 
and n. If the proteins are ordered so that m < n, then we 
obtain optimal speedup since O(m)O(n) = O(mn). 


Processor # 


time step 


n-+m-2 


n+m-] 


SPAN R AP LODE SLOP DASE RD IDESIP MIDE DED IEE DED EDT AD RSEREPSPDI LD AAPIEDOPIOP DD MESA PE DPESSPD AED IPSDDE APPLIAN 


Figure 1: Diagram indicating activity of processor p at time step t. If 
1St-pSn, then processor p computes C,., 5,1 at step t. Otherwise, the 
processor is inactive. 


5 CM-2 Implementation 


The basic goal was to develop software which would calculate 
the scores between every pair of proteins in a given database 
according to an arbitrary scoring metric of the kind described 
above. Since many individual pairwise comparisons have to be 
done for a single database search, it is worthwhile to efficiently 
pack the data so that the overall time spent is minimized. (Note 
that here F is the set of all amino acids with several extra 
symbols added to denote unknowns. Since this domain consists 
of only 23 distinct symbols, we can represent each as a five bit 
integer, ranging from 1 to 23.) 


5.1 Data Mapping 


Since the CM-2 has far more processors than most proteins have 
amino acids, it is possible to perform the dynamic programming 
algorithm on many proteins at the same time. To this end, the 
proteins in the chosen data base are first sorted by length and 
then partitioned sequentially into sets Sg such that the total 
number of amino acids in the proteins of each Sy is as large 
as possible but is less than or equal to the total number of 
processors available in the CM-2. The proteins in each set 
Sq are then compared in parallel to the entire database, one 
protein at a time. 


We denote the nth smallest protein in the database by P,, 
and its length by /(P,,). If the CM-2 we are working on has N 
processors, then we set Sy = {Pa,_,,-++ Pa ,-1}, where ag = 1, 
and 


aga-1 Qd 
UPS N< YS UP). 
a—Qq_-1 a—Ad-1 


is used to determine ag inductively from ag_}. 


We will denote the ith protein in Sq by P#, i.e., 
Pe = fagiti-1> 


and the qth amino acid in this protein by P?(q). 


If M is the total number of such sets Sg, and K is the total 
number of proteins in the database, then the basic structure of 
the serial part of the program to compare all pairs of proteins 
is: 


for d= 1to Mdo 
set up Sq on CM-2 
for k = 1 to K do 
compare in parallel all proteins in Sq to P,, 
retrieve results and store on front end 


where setting up Sg involves laying out the amino acids in the 
proteins in Sq in linear fashion into the memory of the pro- 
cessors with the CM-2 configured as a 1-d grid as described in 
Section 2 [see Figure 2]; the inner loop which compares Sq to 
P,, is implemented using the parallelized dynamic programming 
algorithm just described. Thus, on each pass through the inner 
loop, all the proteins in Sz will be compared simultaneously 
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with some protein P;, which we will designate henceforth as a 
“target” protein. We will use pt to denote the coordinate of 
the processor containing P%(1). Thus, 


a—1 
pe = > U(P#). 
j=l 


We make the following comments: 

1. Since in practice the correlation function D is usually 
symmetric, in order to compare all pairs of proteins in 
the database it is only necessary to compare the proteins 
in Sq to the proteins in all Sy where d’ > d. This is more 
efficient than using the proteins in Sgv where d” < d as 
target proteins since this uses the longer proteins as target 
proteins as recommended in Section 4. 


. It is preferable not to have large variations in the lengths 
of the proteins in a set Sg. Observe that the time needed 
to compare Sy to a protein P is 


l(P) + max{I(P’)| P' € Sa} -1 
I(P) + UPfase) — 1. 


T(d, P) 


This was the motivation for sorting the database before 
partitioning into sets S;. Even though this does not 
necessarily yield an optimal packing of the proteins into 
groups of bounded size, in most situations it seems to do 
well at optimizing UT(d, P), which is the value we really 
want to minimize. This heuristic is related to the First 
Fit Decreasing algorithm for bin packing, which can be 
shown to come within a factor of 11/9 of the minimal 
number of Sq needed [9]. 


. From Figure 1, it is clear that processors are idle some 
fraction (less than half) of the time. One might try to re- 
duce the number of cycles in which processors are idle by 
overlapping comparisons with consecutive target proteins. 
Although this might give a slight performance improve- 
ment, the increased complexity of the inner loop code 
makes it unclear what, if any, gain would be made. 


@) (*) eee 


poet BS 


Or a eee ee 


processors 


Figure 2: Diagram indicating mapping of amino acids in proteins in 


Si (Po, PS Gage na onto CM processors. 


5.2 Inner loop implementation 


‘In this section we will describe the actual Connection Machine 
implementation of the parallel algorithm for the inner loop of 
our code. We assume that the proteins in some set Sq have 
been put into the CM-2’s memory according to the scheme 
described in the previous section. For a given target protein P,, 
at each iteration step ¢ the value of Ct_9,q(F,, P#) is computed 
by processor p = p? + q — 1 We will denote a parallel variable 
foo by foo|], where foo[p] is the value of the variable fool] in 
processor p. foo,[p] is the value of foo[p] at the end of iteration 
t. Thus, we can describe the data from Sq as being stored ina 
parallel variable (also called a “pvar”) sp/[], so that 


sp[pf + q — 1] = Pq), 


recalling that P#(q) is the qth amino acid in P? in its five bit 
integer form. At the time of initialization of sp[] for Sy we 
also initialize a boolean pvar H[] marking the beginning of each 
protein, i.e., 


nto] —4 


The target protein P, is stored in the memory of the front end 
machine. The amino acids from the target protein will be sent 
from the front end to the Connection Machine memories one 
per time step. They will shift across a pvar TGT, and thus be 
aligned with the amino acids sequences held in the pvar sD, so 
that the comparisons can be performed [see Figure 3]. 


processors ———> 


Sa I 
eopeepeePe peel 
Prete tet Tefe le fe 

0 0 0 0 0 0 0, oe 


P a) 


1 + processor p contains P4(1) for some i 
0 otherwise 


Figure 3: Data Layout for the first three iterations of the inner loop for a set of 
proteins in a typical S,, where length(P¢)=4. 


We will make use of the following pvars in the code 


SD holds the amino acid sequences for the proteins 
in Sg as described above. 


H indicates the beginning of each protein in sD. 


CNOW; gets the values of the C’ matrix on the anti- 
diagonal currently being computed, i.e., Cr_¢,¢- 
(It starts with the values from the previous row 


of C, C'r-¢-1,9°) 


CLAST; gets the values of the C matrix entries from 
the previous column of C, i.e., C(¢_1)—(g—1),q-1 = 
Ct-g,q-1- : 


CDIAG: gets the values of the C' matrix entries from 
the 2nd previous anti-diagonal of C, i.e., 


C(t-2)-(q-1),q—1 ra C~g-1,q-1 . 
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(1) 


CMAX; gets the cumulative maximum score for the 
corresponding column of the C matrix, i.e., 
maxy<t Ct! — 9, q). 


TGT holds the shifting copies of the amino acid se- 
querice for P, 


The code for the inner loop now proceeds as follows: 


/* initialize puars */ 
CLAST(p] — 0 
CNOW|[p| — 0 
cMAx[p] <— 0 
TGT[p| — 0 
/* compute score matriz C */ 
for t= 1 to T(d, P,) do 
TGT[p] — TeT[p— 1] 
where H[p] = 1, 
TGT|p] — P,(t) 
cp1ac[p] — cLast[p] + D(sp[p], ret[p]) 
CLAST[p] — CNow[p — 1] 
CNOW|(p] «<— max{0, cLast[p] + g, cNow|[p] + g, 
cDIAG[p] + D(sp[p], Te T[p])} 
CMAX([p] — max{cmax[p], cNow[p]} 
(6) cMax[] — maz-scan (cMAx[]), with segmentation pvar u{] 


(2) 
(3) 
(4) 
(5) 
(2) 


At the completion of the algorithm, we assert that: 
omax([p?,, — 1] = max{C(A’, B’) | Al < P?, B' x P,}, 


so the results of the comparison can be directly retrieved from 
CM-2 memory and stored on the front end for later use. 


Algorithm notes: 


1. All assignment statements are assumed to be carried out 
in all processors, unless otherwise indicated. Thus foo[p] — 
0 means that the pvar foo[] is given a value of 0 simulta- 
neously in all processors. 


. Assignment statements involving pvar data in different 
processors are accomplished using 1d grid communica- 
‘tions. — 


. The where command evaluates some expression in every 
processor simultaneously, and only carries out the conse- 
quent operation in those processors in which this logical 
expression is true. 


. We assume that P;,(t) = 0, for t > I( Px). 


. The evaluation of the function D(sp[p], TGT[p]) was im- 
plemented using an indirect table lookup, where the 5-bit 
bytes sp[p] and TeT [p] are concatenated into a 10-bit 
word used as an offset into a locally stored table. Setting 
D(0,a) = D(a, 0) = 0 for all a ensured that the values of 
C' computed in processors during “inactive” time steps of 
were less than the maximum valid value of C computed 
in the same processor during the same comparison. Thus, 
it was not necessary to explicitly determine which proces- 
sors were active at each time step. | 


6. The scan operation is a primitive used in many parallel 
algorithms [2,8,10]. Scans can be defined as follows: given 
an associative operator @ ((@6b) @c=a@(b@o)), the 
function ® -scan (f[]) returns a pvar z[] whose value at 
processor p is given by 


2(p] = f[1] © [2]... © fly]. 


So-called “segmentation” bits are boolean pvars used to 
break up the scan into disjoint segments, over which the 
scan is independently performed. For example, 


z[p] — ®-scan (f[]) with segmentation pvar i [] 


means 


zp] — fl%p = max{p'|p' < p, h[p'] = 1} @f[¥pt 110. ..oFf[p] 
Thus, line (4) of the above pseudocode results in: 


cMAx(p? + q] = gmax {omax[p! + p'}} 


from which it can be seen that the above assertion about 
CMAX follows. 


G Generalizations 


In many situations one wishes to know more than just the 
score of the best subsequence matches. For instance, it is of- 
ten useful to know where along the protein these matches oc- 
cur or how many gaps are inserted into them. The algorithm 
described above can be modified to retain such information 
without suffering a significant decrease in performance. As an 
illustration, we sketch the modified algorithm used to deter- 
mine the locations of the initial and final amino acids of the 
best subsequences for each comparison, as well as the scores. 


Assume that for every comparison of proteins A and B, we 
wish to compute not only max{C(A’, B’)|A’ <~ A,B’ < By}, 


but also the positions of the first and last amino acids in the. 


subsequences A’, B’. That is, if A’ = (a;,,a;,,...,a;,), B’ = 
(b;,,655)-.-,6;,) maximize C(A’, B’'), we want the dynamic pro- 
gramming algorithm to give us the values of C'(A’, B’), i1, iz, 
ji, and j,. The algorithm can be modified to do this by asso- 
ciating with each matrix entry C,,, the values of 7; and j; for 
the subsequences ending at r,s which yield the maximum score 
for C’,,,. These can be stored in matrices I,,,, Jy,,, and when 
the values of C,,, are computed inductively, the values of I, 
and J,, and be simply carried over from the previous values of 
the appropriate C matrix entries. If we introduce the notation 
A|B to denote the concatenation of two variables, and define 


_j A|B, A>C or (A=C and B> D) 
men AOU) = C|D, otherwise 


then our new recursion relation can be defined as: 


O|r|s 

(Cp~1,2-1 + D(ap, 6,))|Lp—1,2-1|Fr—1,5-1 
(Cy s—1 f g)|Ir,s—1|Jr,s—1 

(Cr~1,s + 9) |Zr—1,8| J, —1,s 


Crs ree | J+,s = max 


And the values of C,%4,91,%2, and jz for the optimal subse- 
quence are given by: 


Clit [ja lz lgz = max{C;,s|J-,s|Jr,5|7|5}- 


It is fairly easy to modify the code described in Section 5.2 
to implement this algorithm. One could use a similar approach 
to retain other information about the best subsequences, such 
as the number of times a particular pair of amino acids ap- 
pear as a;,, b;, for any n, or the values of z for each sequence 
comparison. 


In fact, one can also generalize the dynamic programming 
algorithm to deal with a broader class of comparison functions 
C’. For instance, one might wish to use a scoring function of 
the form: 


C(A’, B’) = > D(ai, ,bj,) + Sly (on + Te) + x(OK) + x(Te)] 
k=1 k=1 


0, n=0 
-G otherwise 


where x(n) = 


This function is similar to the general case considered above, 
however it incurs an extra penalty for having any gap at all 
between consecutive amino acids in a sequence, making the 
penalty for continuing a gap once it has begun relatively smaller. 
This is of interest in comparing proteins since the random pro- 
cess of mutation is more likely to remove an entire group of' 
amino acids by breaking protein sequences than to remove the 
same number of individual proteins from different locations on 
the same protein. 


To implement such a function, one must modify the dy- 
namic program slightly more than in the previous example, but 
the main idea remains the same. For a scoring system such as 
this, one sets up a new set of variables which keep track of the 
best scores in all four of the possible combinations of situations 
where the subsequences A’ and B’ either end in gaps or do not. 
Actually, since any situation in which both subsequences end 
in gaps can be expressed as a gap in one of the sequences fol- 
lowed by a gap in the other, these values can be put into three 
matrices defined as follows: 


C00,,, = max{C(A’, B’)|i, = r,jz = s}, 


C10,,, = max{C(A’, B') + G+ 9+(r— iz) 
[tz < Ps Jz = s}, 


CO01,,, = max{C(A’, B')+ G+q9-(s— jz) 
\t. =?T,Jz< s}. 


Where max is taken across all A’ = (a;,,...,a;,), B’ = 
(0 scteansOes 
We then have the recursion relations: 


0 

C'10,-1,.-1 + D(a,, b,) 
C'01,~1,s-1 + D(a,, b,) 
C'00,~1,s-1 + D(a,, b,) 


C'00,,, = max 


C'00,,s-1 +G+ g 
C'10,,s-1 +G+ g 
C01,,5—-1 se g 


C'01,,, = max 


C'00,-1,s +Gt+ g 
C01,_1,5 +G+ g 
C'10,_1,s 7 g 


C'10,,, = max 


which, as in the basic algorithm, allow us to compute the best 
subsequence by computing max C'00,,,. Thus, the dynamic pro- 
gramming method allows us to do sequence comparisons with 
a variety of scoring metrics, and to save any information about 
those comparisons which could be useful. 


7 Results 


Using the software described here we have begun to experiment 
on databases of proteins using a variety of scoring metrics. The 
inner loop of the basic algorithm takes about 300 seconds on a 
CM-2 with a clock speed of 7Mhz when no position information 
is computed, and about 600 pseconds when the positions of the 
best subsequences are calculated as in Section 6. We ran a 
database of 2192 proteins against itself using a simple scoring 
metric of the form D(a;, a;) = (26;; -1)+a,g = 6 < 0, where 
a, 8 determine the relative penalties for mismatches and gaps. 

This is one of the matrices most frequently used by biolo- 
gists. The run took about 6 hours on an 8K CM-2 to compute 
scores between all pairs of proteins and the positions of the sub- 
sequences giving these scores. (This represents about 1.8 x 10}! 
matrix entry computations, and would have taken ~ 45 min- 
utes on a 64K machine.) We have begun to analyze the results 
of this comparison search using various clustering heuristics to 
find multiple-sequence similarities. 


Another set of experiments which we carried out compared 
a variety of scoring systems on several small databases. It can 
be shown mathematically [15] that when one compares ran- 
dom proteins with a scoring metric of the form D(a;,a;) = 
(26;; - 1) + a,g = @, the mean length of the subsequences in 
the optimal matches varies according to the values of a and 8 


in a regular manner. The af plane divides into two regions, 
one where the mean length of optimal matches varies as the log 
of the lengths of the proteins, and another where the optimal 
matches vary linearly with the protein lengths. There is a re- 
gion on the af coordinate plane where the transition between 
these two domains occurs, but an explicit determination of this 
phase transition point has never been made. By computing the 
mean ae ; 
(2 — ty + Jz = ar) 
2 

across all pairwise comparisons for varying values of a, 8, we 
were able to locate this transition region empirically, and to de- 
termine the qualitative behavior of this scoring function under 
changes in the parameters a and /, [see Figure 4]. We repeated 
this experiment for actual and random proteins, i.e., sequences 
of amino acids generated according to the observed distribu- 
tion of amino acids in the set of proteins in the database, and 
obtained essentially the same results, indicating that the con- 
clusions drawn here could be applied to comparisons of actual 
databases. 
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Summary 


The use of highly parallel computers has enabled us to make 
progress in several areas of biological research (Section 7), which 
would have been difficult without software tools to do high- 
speed subsequence comparisons using arbitrary scoring metrics. 
The methods used here could easily be modified to be used in 
other contexts such as speech recognition. 


P= \ oz | -@.cd -8.75 -@.69 -@.5@ -@.98 -w.25 -@.13 6.00 @.13 90.25 9.98 90.509 0.69 @.75 9.88 1.98 
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Figure 4: A protein database consisting of fifty (50) proteins with lengths between 192 and 220 
was compared with itself using a scoring matrix D(aj, aj) = (26jj-1) + a, g =B. The figure gives 
the mean length of the best subsequence match for various a, B pairs. The phase transition region 


is highlighted in gray. 
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ABSTRACT 


Linear optimization problems are commonly solved via an iterative 
technique known as the simplex method. This paper proposes and exam- 
ines the performance of several, parallel variants of the simplex and 
revised simplex algorithms on the Intel iPSC, a message-based parallel 
system. Linear optimization test data are drawn from commercial sources 
and represent realistic problems. Analysis shows that the speedup 
obtained is sensitive to both the structure of the underlying data and the 
data partitioning. 


1, INTRODUCTION 


Linear optimization problems are underconstrained linear systems 
that maximize or minimize some objective function. These linear 
optimization problems are natural formulations of many business plans 
and often contain hundreds of equations with thousands of variables. 
Changing economic conditions dictate that many organizations solve 
these large linear optimization problems daily. 


Historically, linear optimization problems have been solved via the 
simplex method [{Luen73]. Although 
computational complexity of the simplex method is not polynomial in the 
number of equations, experience has shown that its average case behavior 
is linear. Despite the excellent performance of the simplex method, the 
size of the optimization problems and the frequency of their solution 
make linear optimization a computationally taxing endeavor. This 
computational complexity, coupled with the wealth of research on the 
simplex method, make parallel solution of linear optimization problems 
an attractive research problem. Implementations of simplex utilizing 
special purpose VLSI arrays have been proposed [BeBo87, OnNa84]. 


This paper examines the performance of several, parallel variants of 
the simplex and revised simplex algorithms on a message-passing system. 
A review of the linear optimization problem including a discussion of the 
simplex and revised simplex algorithms is presented in §2. In §3, 
possible parallelizations of the simplex algorithm are discussed, together 
with their potential advantages and disadvantages, and results of 
benchmark studies of the alternatives are shown. In §4, this discussion is 
continued for the revised simplex algorithm, and the simplex and revised 
simplex algorithm performance is compared using linear optimization 
data drawn from commercial sources. This performance analysis shows 
that the speedup obtained is sensitive to both the structure of the 
underlying data and the data partitioning. A summary of the work is 
presented in §5. 


2. LINEAR OPTIMIZATION AND THE SEQUENTIAL SIMPLEX 
ALGORITHM 


2.1. General Linear Optimization Problem 
Mathematically, the linear optimization problem can be stated as: 


Minimize: cTx (1) 


Subject to: Ax =), where b 20 
x20 


genre ee Tn RR A A LS 
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it is well-known that the 
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Here, c? is ann vector of variable coefficients that defines the objective 
function (i.e., the function being minimized). For a maximizing problem, 
the negative of the objective function can be minimized. The objective 
function can thus be viewed as a cost function, where the object is to 
minimize total cost. The m xn linear system Ax = b defines the linear 
constraints on the objective function x. Each of the m rows of the 
matrix A defines a constraint on the n variables of the objective function. 


The optimization problem arises because the linear system Ax = b 
is underconstrained (i.e., m is smaller than n, and the matrix A contains 
many more columns (variables) than rows (constraints)). Consequently, 
there are many possible x vectors that satisfy the system Ax =b. A 
fundamental theorem of linear programming states that an optimal 
solution, if it exists, occurs when n —m elements of x are zero (i.e., 
when there are precisely m non-zero elements of x). This corresponds to 
the solution of an m x m linear system, the basis, obtained by selecting — 
m of the n columns of the matrix A. The simplex method is a search 
algorithm that decreases the value of the objective function at each 
iteration by selecting a non-zero element of x, a so-called basic variable, 
and replacing the corresponding column of A with another column. 


In this paper, two forms of the simplex method are examined: the 
simplex algorithm and the revised simplex algorithm. Although these 
two algorithms are based on the same underlying principles, they use 
different types of operations to reach the solution. In the remainder of 
this section a brief, and by no means complete, overview of the 
algorithms is presented. For more thorough treatments of the simplex 
method and optimization theory, see [Foul81, Murt83, Llew64, Luen73]. 
See [KuTZ71, Solo84] for helpful hints on practical implementation of 
the simplex and revised simplex algorithms. 


2.2. Simplex Algorithm 


In most practical problems some or all of the equations in Ax = b 
are inequalities. There is a simple method of transforming these 
inequalities into equalities while maintaining the x =O constraint. In 
addition, one must obtain an initial feasible solution (an initial solution 
that satisfies the constraint x 20). Thus, each equation forming Ax = b 
is transformed as follows: 


If 34%) < bj, a slack variable is added: Sax) + 5; = 5; 
j= J= 


If 3 aijx) > b;, a surplus variable and an artificial variable are 
I= 


n 
added: Dd 4jXj — Si tri = b; 
J=1 


If 34/3) = bj, an artificial variable is added: Sax) +7 =; 

i= j= 
The artificial variables must be zero when the optimum value is found. 
They aid in obtaining an initial feasible solution. The slack and surplus 
variables may or may not be zero when the optimum is found. Clearly, 
the x 2 0 constraint can be still be satisfied by these new equations. 


To explain the notion of a feasible solution, the notion of a basis 
must be reviewed. A basis of A is a linearly independent collection of m 
columns of A. It can be represented as B = [Aj,--- Aj,]. With a valid 
basis of A, a basic solution can be found by setting all components of x 
corresponding to columns of A not in B to zero and solving the resulting 
m equations to determine the remaining components of x. These are the 
basic variables. If the solution to these equations satisfies x > 0, then it 
is a basic feasible solution. The initial basis is commonly chosen as the 


A columns corresponding to the slack and artificial variables, because 
this basis always results in a basic feasible solution. 


The simplex method systematically moves from basic feasible 
solution (bfs) to bfs. This movement, called pivoting, will terminate in 
an optimal solution if one exists. This pivoting consists of three steps: 
(1) finding a new basis column that decreases the objective (cost) 
function value, (2) finding the column to remove from the basis that 
maximizes the decrease in the objective function value while maintaining 
the requirements of a bfs, and (3) replacing the old basis column with the 
new one. 


If B is an mxXm matrix representing the current basis, and Agi) is 
the column of A that is currently the i-th basis vector of B, any column 
Aj in the original A matrix can written in terms of the basis vectors of B: 


Aj = YyaiAs @) (2) 


Let xg i) represent the i-th value of the current bfs, and let cg) represent 
the element of the objective row corresponding to this variable xg(. 
Equation (2) can be interpreted as meaning that for every unit of the 
variable x; that enters the basis, an amount a;; of the each of the 
variables xg qi) must leave. Thus a unit increase in the variable x; results 
in a net change in the objective function equal to: 


— m att 
Cj =cj — 2yticB@ (3) 


It is then profitable to make x; basic and to bring A; into the basis when 
cj <0. Also, if all cj 20, then a local (and global) optimum has been 
reached, and simplex terminates. 


A; is the representation of A; with respect to B, and can be 
obtained as Aj = B-!A;. Also, because b is simply another column, it 
can also be represented in terms of the basis as b = B-1b. The value of 
cj can now be calculated in terms of B: 


m_ = 
Ci = Gj - 2 Hi CBG) = cj —chA; = cj —chB-1A; 
i= 


or Cj =cj — mA; where 1” = c§B- 

Any x; with c; <0 can become basic and decrease the objective row 
function, but generally the x; corresponding to the minimum c; is chosen. 
Although there is no theory that guarantees this to be the best choice for 
decreasing the objective row function, empirical data show that it 
suffices. The corresponding A column entering the basis is also termed 
the pivot column. 


Now, a column must be chosen to leave the basis. The column 
must be chosen such that replacing it with the entering basis column will 
guarantee that a bfs still exists with the new basis. Although the details 
will be omitted, the following criterion guarantees that a new bfs will be 
reached (for entering basis column = Apiy ¢o!): 


find the row i which minimizes oan eee 


Qi piv col 


for all aj »piv_col > 0 


Intuitively, the purpose of this criterion is to ensure that the new solution 
b of the bfs is non-negative. This row i (also known as the pivot row) 
indicates that column Ag q) of the basis representation should be replaced 
by Apiv cot. Correspondingly, xpiv cot replaces xgq@) as a basic variable. 
This completes one iteration of the simplex algorithm. 


It can been shown [Foul81] that simple matrix row operations via 
Gauss-Jordan elimination produce A, c, and b without altering the set of 
feasible solutions. The simplex algorithm uses this Gauss-Jordan 
transformation of the tableau to move from bfs to bfs. Figure 1 shows 
the initial simplex tableau, with the starting values for A, b, and c 
simply equal to A, b, and c. xo is the current optimum value of the 
objective function. Figure 2 enumerates the computational steps for each 
iteration. The third step uses Gauss-Jordan elimination to manipulate the 
entire coefficient matrix in a pivot operation. This places a 1 at the pivot 
point (the intersection of the pivot column and row) and zeros elsewhere 
in the pivot column, including the objective row, and updates A, b, and 
C. 


Analysis of Figure 2 shows that the complexity of the simplex 
algorihm is O(@mn—m2%+n+m) _ additions and comparisons, 
O (mn—m?2+m) multiplications, and O (m+1) divisions per iteration. 


C1 C2 ° e e Cn » 4) 


Figure 1. Simplex Tableau 


min_obj = minimum (c;) 


Locate piv_col j=i,...m 


if min_obj = 0 then exit, optimum has been found! 


piv_col = j for the column with this minimum c; 


b; 


Locate piv_row | min_ratio = min for Gi piv cot >0 


=1,...M Q; piv_col 


piv_row =i for the row with this minimum ratio 


Gauss-Jordan for i # piv_row 


factor; = — Gi piv col 


Qpiv_row piv_col 


Elimination 


Gij = aij — factor; Apiv row j JH1,.0.,0 
b; = bi = factor; Dpiv row 
end for 
Cj = Cj ae ae Cpiv col Apiv_row,j J=1,....n 
Apiv_row piv_col 
Qpiv row j = — “piv row jy J=1,....n 


Qpiv_row piv_col 


pon b 7 
Dpiv_row _ ee —Zon 


Apiv_row ,piv_col 


Figure 2. Simplex Algorithm 


2.3. Revised Simplex Algorithm 


__ _Unlike the simplex algorithm, which continually updates all of the 
A columns, the revised simplex algorithm [Solo84] maintains only 
enough information_to reproduce an updated objective row or any 
updated column of A, and relies on B and 7 as in the explanation of the 
general simplex method. The b column is handled identically to the 
simplex algorithm. The original A matrix and objective row are never 
changed, and since A is usually quite sparse, it can be stored in sparse 
form. The B matrix is never needed in computations, but its inverse B-! 
is used for calculating n and A values. Although the proof is beyond the 
scope of this introduction, B-! and —x can be maintained through 
elementary row operations. 


The search for the pivot column (the "find piv_col” step) in a 
revised simplex iteration uses —1 to calculate the updated c’ row as 
shown in Figure 3, and the search for the pivot row (the "find piv_row" 
step) uses B-! to calculate the updated Apiy cot. Matrix B-! is initially an 
mxXm identity matrix, and —r is initially a 0 vector. Once the pivot 
column and pivot row have been identified, the B-! matrix and —n vector 
can be_updated through a process similar to Gauss-Jordan elimination, 
using Apiv cot to determine the row fractions as shown in Figure 3. 


In the revised simplex, more time is spent in finding the minimum 
objective and ratio values, but less time is required for updating the 
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matrix. If s is a fraction representing the percentage of non-zero values 
in the sparse A matrix, then the number of addition and compare 
operations per iteration is O(mns+m2+n+m). Similarly, the number of 


Find piv_col | cj =cj —1TA; j=1,.... 


min_obj = minimum(c;), j=l, 


if min_obj = O then exit, optimum has been found ! 


piv_col = j for column with this minimum ¢; 


Find piv_r Ow Apiv col = B lApiv col 


min_ratio = min 


i= 


for Qj piv_col> 0 


vl Ai piv col 


piv_row =i for row with this minimum ratio 


Gauss-Jordan | for i 4 piv_row 


factor; = — Gi piv col 


Qpiv_row piv_col 


Elimination 


By} = Bi} — factor;Bzi row  j=l,...,m 
end for 
ee ‘ Cpiv cal Bz i=] 
a= y-= piv_row,j J=1,..., 
Apiv_row,piv_col 
Br : ; 
Bpid row j = eee 1 AEA 2) A J=1,....m 


Apiv_row piv col 


b, j ee bpiv row 
piv row = = 
Apiv_row piv col 


Figure 3. Revised Simplex Algorithm 


multiplications performed per iteration is O @nns+m2t+m). The number 
of divisions is O(m+1). Overall, it can be seen the algorithm requires 
O(2mns+2m7) floating point operations per iteration, compared to 
O(2mn—2m7) for standard simplex. For a sparse A_ tableau and 
m<n—m, revised simplex requires fewer floating point operations. 
Another advantage of revised simplex is the algorithm’s computational 
robustness. Revised simplex does not update the entering variable 
column until it actually enters the basis. Because of this, the 
accumulation of round-off errors on the coefficients of the column is 
reduced. 


3. PARALLEL SIMPLEX ALGORITHMS 


The message passing algorithms of this paper were developed on 
an Intel iPSC hypercube with 16 processors, each with an extended 
memory of 4.5 Megabytes. A 32 node Intel hypercube with 0.5 
Megabytes of memory per node was also used for experiments measuring 
communication times. The extended memory on the 16 node machine 
supported larger problem sizes, and made it possible to measure single 
node performance for calculating speedups and efficiencies on large 
problems. 


In this implementation, there is little difference between a serial 
version of the simplex or revised simplex algorithm and the parallel 
version running on only 1 node, except for down-loading and up-loading 
of data from and to the host at the start and the finish of the program. 
This facilitated the measurements of serial execution time. Serial 
execution time was defined as the time between when the last data is 
down-loaded from the host to the time when up-loading of the answer to 
the host begins (i.e., all host-node interactions were excluded). Parallel 
execution time for 1 or more nodes included the time for down-loading 
and up-loading data. Hence, calculated speedups for a single node are 
less than 1. : 


[Phase [Operation ——_——~=d;S 
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The linear optimization test data are drawn from commercial 
sources and represent realistic problems. We benchmarked several 
problems with a range of sizes from "“afiro” (m=27 and n=59) to 
"bandm" (m=305 and n=777) on 1, 2, 4, 8, and 16 nodes of the iPSC. 
Larger problems were also run on 16 nodes, but did not fit into the 
memory of a single hypercube node, hindering speedup calculations. 


In message passing architectures, many implementation details are 
determined once the data distribution decisions are made. This is 
primarily a consequence of the relatively high cost of communication 
versus local memory access. The simplex and revised simplex 
algorithms share many characteristics with linear equation solution, 
matrix multiplication, and other matrix operations. Previous work on 
distributed linear system solvers has advocated row or column 
partitioning of matrices [GeHe85, Mole85, AyOz87]. Similar methods 
are pursued here, and two data distribution strategies are presented for 
solving the simplex algorithm efficiently on hypercubes. 


3.1. Column Partitioned Simplex 


In the column partitioned method (method A) complete columns 
(including the objective value) are divided equally among the processors. 
The pivot column determination requires two steps: (1) finding the local 
minimum of the objective values for those columns at each node, and (2) 
using a global minimum communication process to identify and distribute 
piv_col, the identity of the column containing the minimum objective 
value. Implementation of the global minimum communication is 
discussed in §3.3. 


Because one processor contains the entire pivot column Apiy coi, 
that processor can now determine the pivot row if it also possesses the 
vector column b. Hence, one must balance distributing and maintaining 
this b column at each node against adding a communication step to pass 
the 6 column from its resident processor to_the processor containing 
Apiv cot. Once the pivot row is determined, Apiy co is passed to every 
processor node along with the value of piv_row. Each node then uses 
Apiv cot to perform the Gauss-Jordan elimination, exchanging the basic 
columns, and the simplex iteration is complete. 


Another consideration for column partitioning is the assignment of 
processors to tableau columns. Both the column block and the column 
wrap methods [GeHe85] were considered, however in the standard 
tableau format, basic columns were initially concentrated on the right side 
of the A matrix. Basic columns require less work to maintain because 
they need not participate in the Gauss-Jordan elimination. To equally 
divide these basic columns among the processors, the column wrapping 
method of partitioning was chosen. 


The potential performance advantage of distributing b to every 
node was investigated. Benchmarks showed that, for the Intel hypercube 
with dimensions 0 to 5, keeping the b column on only one node and 
sending it to the node possessing Apiv cot was more efficient than 
distributing the b column. Figure 4(a) summarizes the column partition 
data placement. Figure 5 shows the speedups obtained both for a 
distributed b column and for b maintained at only one node. This figure 
illustrates the effect of problem size on speedup. For a given hypercube 
dimension, the time required for the parallel Gauss-Jordan elimination 
step increases with both m and n, whereas the overhead for global 
communication remains constant. Also, the time required for the serial 
computation of piv_row and the global send of Apiv co: increases with m. 
Thus, as the product of m and n increases, the parallel steps consume a 
larger portion of the total execution time, so speedup and efficiency 
increase. For the problems shown, keeping the b column on one node 
provides marginally greater speedups. For the "scsdi" problem, the 
difference in speedup is quite small. This is because updating the b 
column is_inexpensive (small m) in comparison to the total cost of 
updating A (large n, resulting in many columns per node) through 
Gauss-Jordan elimination. 


3.2. Row Partitioned Simplex 


In the row partition strategy, complete rows of the tableau 
(including the value from the 6 column) are divided equally among the 
processors. There are two different options for distributing the c? 
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Figure 4. Data placement for simplex column and row partitioning. 
(for m = 4, n = 7, and 4 processors) 
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Figure 5. Speedups for the simplex column method with 
two different b column distributions 


(objective) row: equally divide the elements of the c? row among the 
processors, or give each node the entire c! row. 


If each node possesses the entire objective row, then the search for 
piv_col can be done separately by each processor and requires no 
communication. If each node contains only a portion of the objective 
row, then a global search is required to find piv_col, but each processor 
can find its locally minimum piv_col in parallel with the other 
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processors. These options again illustrate the communication versus 
computation trade-off. The benchmark results show that it is more 
efficient to partition the c’ row and proceed with the global 
minimization, even for the smaller problem sizes that were tested. For 
the larger problems the difference was large (e.g., in one case speedup 


increased from 8 to 11 when the c? row was partitioned). 


After piv_col is determined, piv_row is calculated. Each node 
possesses a portion of the Apiy co: Column along with the b column 
elements of the same rows. Thus each node has enough information to 
find the minimum ratio bj/G; piv cot for all of its @i piv cot > 0. A global 
minimum of each locally minimum piv_row is needed to find the global 
piv_row, and piv_row is distributed to every processor. 


For the Gauss-Jordan elimination step, the entire row 
corresponding to piv_row must be resident on every processor. 
Therefore, the processor holding this row sends it to every other node, 
and the elimination step executes. Figure 4(b) shows the row partition 
data placement, and Table 1 compares differences in the simplex 
algorithm variants, along with the revised simplex algorithm variants to 
be discussed later. 


3.3. Methods for Global Minimum 


Each simplex algorithm partitioning requires at least one 
computation of a global minimum during each iteration. For all cases, 
the result of this global minimum is needed at every node. Two different 
methods were compared for obtaining the global minimum. These two 
methods are illustrated in Figure 6. Method EXCHANGE pairs nodes for 
an exchange of minima during a step. On each step, nodes that differ by 
a single bit of a particular power of 2 (i.e., a particular dimension) are 
paired. Log N of these exchange steps are required, with a different 
dimension used to pair processors for each step. Method CONDENSE 
starts by passing all local minima from the upper half of the hypercube to 
the lower half. Then the lower half splits and passes its newly calculated 
minima from its own top half to its bottom half again. This process 
continues until one node (node 0 in this implementation) contains the 
global minimum. This phase of CONDENSE can be viewed as an 
inverse global send. Node 0 then globally sends this minimum to every 
other node. 2log N steps are required, but at each step, a processor is 
either sending or receiving, but not both as in EXCHANGE. In addition, 
no intermediate computation is needed during the log N steps required to 
globally send the minimum. 


Table 2 shows the results benchmarking the comparison of these 
two methods. The numbers are normalized to the CONDENSE 
communication times to show relative communication costs; these costs 
include waiting time. The results show that CONDENSE is slightly less 
efficient than EXCHANGE for a 2-node hypercube but becomes 
increasingly more efficient as the dimension of the hypercube increases. 


Major step within the algorithm 


Simplex 
Algorithm 


Partitioning Gauss-Jordan elim. 


Column (A) Sequential computation 
Row (B) Parallel computation 
Global minimization 


Apiv cot global send 
Parallel computation 


A row global send 
Parallel computation 


Revised (C), 
entire A on node 


Parallel computation ~1 row global send 


Global minimization Parallel computation 


Revised (D), 
partitioned A 


Apiv col global send 
Parallel computation 


B-! row global send 
Parallel computation 
Global minimization 


Table 1. Summary of Hypercube algorithm differences 
(finding piv_col is similar for all partitionings) 


CONDENSE Step 3 


CONDENSE Step 1 Global minimum, node 0 


CONDENSE Step 2 


CONDENSE Step 4 
Start global send 


CONDENSE Step 6 
Global min. at all nodes 


CONDENSE Step 5 


0 ae 
EXCHANGE Step 3 


EXCHANGE Step 1 Global min. at all nodes 


EXCHANGE Step 2 


Figure 6. CONDENSE and EXCHANGE methods 
for finding a global minimum. 


Because the Intel iPSC hypercube does not support simultaneous bi- 
directional transmission, the EXCHANGE method actually requires 
2log N node-to-node message delays and 2log N comparisons of 
minima at each node. The CONDENSE method requires 2 log N 
message delays and only log N comparisons of minimums. The main 
reason for CONDENSE’s superiority is that messages received out of 
order will not destroy the global minimum calculation (e.g., if node 0 
receives the current minimum from node 2 before it receives the 
minimum from node 4 in Figure 6). This is because each node does not 
send any information until it first receives all of its required minimum 
information. The EXCHANGE method has no similar property, and 
messages arriving out of order must be saved until the proper time, so 
more synchronization overhead is inherent. The chance of out of order 
messages rises with increasing hypercube dimension, favoring 
CONDENSE for larger hypercubes. There is a further advantage to the 
CONDENSE method - most communication steps require only a subset 
of the processors, so any computation occurring concurrently will 
proceed with fewer interruptions. 


In light of these results, the CONDENSE method was chosen for 
all global minimum calculations. On a hypercube implemented with two 
physical links between nodes, the EXCHANGE method should be 
superior, since the number of steps will reduce to log N. 


3.4. Performance Comparisons between simplex methods A and B 


The simplex column partition method A and the row partition 
method B use an identical method for finding piv_col. Finding piv_row, 
however, is quite different. The column method is a serial computation 
with no communication. The row method is a parallel computation but 
requires a global minimization step. This provides another computation 
versus Communication trade-off. 
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Problem 
Hypercube "sharelb" problem "israel" problem 


Dimension 


CONDENSE | EXCHANGE | CONDENSE 


Table 2. Normalized global minimization communication delays 
(normalized to the CONDENSE delays) 


The Gauss-Jordan step requires a column global send for the 
column partition and a row global send for the row partition. In most 
cases, m <n, and the column global send is more efficient. However, 
the column partition requires each node to calculate a row multiplication 
factor for all rows during the Gauss-Jordan step. In the row partition, 
each node only calculates these factors for the rows it possesses. 


Figure 7 compares the column method A to the row method B for 
various small to medium sized linear optimization problems. For most 
problems that were tested, row partitioning achieves a higher speedup for © 
2 - 16 processors. The extra serial computation required for the column 
method is more costly than the extra global minimum required for the 
row method. Also, as the problem size increases, the cost of the serial 
portion of the column method increases faster than the costs for finding 
piv_row in row method B. One interesting exception is "scsd1," which 
has a small number of rows m with a number of columns n >> m. For 
this small number of rows, the serial piv_row phase of method A is not 
expensive, but for method B the communication of the long pivot row 
required for Gaussian elimination is expensive, and method A has much 


better speedup. 
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Figure 7. Speedups for simplex column and row partitioning methods 


4. REVISED SIMPLEX PARTITIONING METHODS 


The revised simplex algorithm usually offers a reduced amount of 
computation, but increases the complexity of interaction between various 
elements of the data structure, and thus potentially incurs more 
communication overhead. The A matrix is static and sparse - typically 
between 5 and 10% [Solo84] of the elements are non-zero - and all 
operations involving this matrix use an entire A column (vector-vector or 
matrix-vector multiplication). A rows carry no significance in revised 
simplex. Thus, either partitioning columns equally among the nodes or 
placing the entire matrix in each processor’s memory are the viable 
options. Since the A matrix is not manipulated, there is no penalty for 
keeping the entire matrix at every node unless there is not enough 
memory space. For now, it is assumed that the entire A matrix and 
objective row are stored at each node, and A is stored in a sparse data 
structure. 


It was found in the simplex algorithm that the piv_col calculation 
should be performed through a global minimization, and this is even 
more valid for revised simplex since the piv_col search must be preceded 
by the computationally intensive update of the objective row 
(cj = cj-nTAj;). Every node also needs a copy of —1 before the piv_col 
search begins. In the Gauss-Jordan elimination, —x is functionally 
another row of B-!. Also, a node participating in the piv_row search 
needs elements of Apiy co: and b from each row it is assigned to search. 
These observations will facilitate the B-! discussion that follows. 


Storing the B-! matrix (m by m) generally requires more memory 
than the tableau, and in sequential computation the Gauss-Jordan 
elimination step on the matrix is the most expensive step. Hence a 
parallel implementation should partition B-! among the processors to 
parallelize the Gauss-Jordan elimination. The natural_ ways of 
partitioning this matrix (which is also used in the A; =B-A; 
calculation of the new pivot column) are by rows or by columns. Unlike 
in the simplex algorithm, however, partitioning by rows is clearly better 
than partitioning by columns. A step by step comparison will make this 
apparent. For the piv_col step, assume column partitioning of B-!. Then 
—m must be distributed among the nodes, since the update of —1; depends 
upon elements in the i-th column of B-!. Hence —7 needs to be gathered 
and sent to every node (similar communication complexity to global 
minimization) before each step to find piv_col. If B-! is row partitioned, 
the entire —1 row can reside on one node (necessitating a global send of 
—m during each iteration), or on every node (no communication for —1 
needed). Either of these row partition options for the —x vector are 
superior to the column partition’s option. 


_ In the piv_row — step. with column partitioned B-, 
Apiv col = B—Apiv col produces m partial sums which must then be 
combined through a global sum communication. In contrast, row 
partitioning allows B-!Apiy cot to produce complete sums for those rows 
of B-! that are on that node. In addition, if the b elements of the same 
row are present, a local piv row minimum can be found with no 
communication. 


The Gauss-Jordan elimination step is similar for both partitions. 
Because of the inherent advantages in finding piv_col and piv_row, the 
B-! row partitioning was chosen for method C, with b partitioned as if it 
were an extra column of the B-! matrix. 


Now the options for the —x distribution will be readdressed 
(maintain —7 at one node or at every node). It was found in simplex 
method A that the 6 column should reside on only one node and be sent 
the node possessing Apiv cot. In contrast, the —m vector is needed by 
every node during every iteration, requiring a global send if —7 is 
maintained by only one node. Benchmark tests showed that it was less 
expensive to maintain —x through Gauss-Jordan elimination on every 
node than to perform this global send. The data partitioning for method 
C is shown in Figure 8. Figure 9 demonstrates that —x should be 
maintained by every node. 


4.1. Overlapped Communication and Computation 


There is potential for overlapping the communication required in 
each new iteration’s piv_col global minimization step with th: Gauss- 


- Jordan elimination § step that 
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Sparse tableau partitioning for method D 
(in method C, entire A matrix & objective row 
resides on each node) 


Figure 8. Data partitioning for the revised simplex methods C & D 
(for m = 4, n = 7, and 4 processors) 


terminates the previous iteration. 
Overlapping these two procedures can reduce the waiting and 
synchronization time that is inevitable in the global minimization. The 
Gauss-Jordan elimination step can compute the new c row (for simplex) 
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Figure 9. Speedups for revised simplex method C 
with two different —x distributions 


or the new —2 row (for revised simplex) before any other rows. With 
this information the locally minimum piv_col can be found at each node, 
and the global minimization communication steps can begin. Instead of 
waiting for sends or receives to complete, the processors perform the 
Gauss-Jordan elimination, checking the status of pending messages 
before starting the elimination step on a new row. When a status check 
reveals that a message is ready to be sent or received, the processor 


handles the communication step immediately, and then resumes the 
Gauss-Jordan step. This communication overlap was implemented for, 


simplex method B and revised simplex method C (and method D to be 
discussed shortly). A modest increase in speedup was observed, but not 
as much as expected (only about 2 - 3% for large problems). This is 
partly because the computation processor on the Intel hypercube is 
utilized for much of the sending and receiving work. It is expected that 
overlaps of this type will have a greater effect on hypercubes with more 
separation of computation and communication hardware. 


4.2. Revised Simplex with Partitioned A Tableau 


For the problem sizes tested and the small number of nodes on the 
hypercubes used in this study, storing the entire sparse A tableau at each 
node was no more of a problem than storing the partitioned B-! matrix. 
However, as the number of nodes increases, the B-! matrix partition at 
each node becomes smaller, while the A tableau storage remains 
constant. So, for larger problems and large numbers of nodes, it may not 
be practical to keep the entire A matrix at each processor. Therefore a 
solution (method D) was investigated that involved a column partitioned 
A matrix (see Figure 8). The major change required is an extra 
communication step after piv_col is found. The node possessing Apiy co! 
globally sends this sparse column to every other node. Figure 10 shows 
the speedup of the original revised simplex method C and the new 
method D for various problem sizes. The difference in speedup for the 
two methods decreases as the problem size increases, because the time 
required for method D’s extra communication is relatively constant while 
the computation time is increasing. In fact, for the problems tested in 
this study, Apiv co! WaS Sparse enough that its message length was always 
under the 1K packet size of the Intel machine, and so the communication 
overhead was fixed. For large problems the change in speedup is 
minimal. 
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Figure 10. Performance speedup for revised simplex methods. 


4.3. Comparison of Simplex B and Revised Simplex C Methods 


Despite the more complicated structure of the revised simplex 
algorithm, a method of partitioning the data was found such that 
communication requirements are quite similar to that of the row- 
partitioned simplex method. Each method requires two global 
minimization steps, and each method requires a global send before the 
Gauss-Jordan elimination step can proceed. The global send for the 
simplex case is a full row of the A tableau (length of n), but for the 
revised simplex case a row of the B-! matrix (length of m) is sent. Since 
m is typically less than n, the revised simplex method requires smaller 
messages. Figure 11 compares execution times of methods B and C for 
several optimization problems. Execution time is used instead of 
speedup since the basic sequential algorithms for the two methods are 
different. 


The key observation to make from Figure 11 is that the method 
(simplex or revised simplex) that performs fastest sequentially will also 
perform fastest in parallel. This is a consequence of the similar 
communication structure, which makes it easy to predict relative 
performance of the parallel algorithms. For most larger problems, 
m<n-m, which favors the revised simplex method. A _ large 
difference in execution time is seen for "scsd1", for which m «<n. 
"Share2b" has m >n —m, which favors the standard simplex. 
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Figure 11. Execution times for method B and revised method C 
(Execution times plotted on log scale) 


5. SUMMARY 


Several partitioning and communication strategies were explored 
for executing the simplex and revised simplex algorithms on a hypercube. 
Column and row partitionings were compared for the simplex algorithm, 
and the row partitioning method was found to be generally superior. 
Column partitioning is more efficient when the number of rows m is 
small, and the number of columns n is much greater than m. 


Although revised simplex is a more complex sequential algorithm 
than simplex, it was found that by partitioning the B-! matrix by rows 
and maintaining both the —z vector and the static A matrix on every 
processing node, communication costs could be kept roughly equivalent 
to that of the row-partitioned simplex. Comparisons made between 
actual execution times of these simplex and revised simplex algorithms 
showed that whichever algorithm performed better sequentially also 
performed faster in parallel. If m <n —m, the revised simplex version 
should execute faster. 
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A revised simplex algorithm that partitioned the A matrix by 
columns among the processors was also investigated. This partitioning 
reduces the memory requirement for each node but necessitates an extra 
message during each iteration. This message is generally small and 
showed very little impact on performance for the larger problems we 
tested. 


Although in theory the simplex and revised simplex algorithm 
calculations can be almost completely parallelized, communication costs 
are a significant factor in the actual iPSC execution time. This is 
primarily a result of the two global minimizations. Identical calculations 
done by each node, such as maintaining —n in the revised simplex 
method, also contribute to a drop in efficiency as the number of nodes 
increases. Options were explored for optimizing the communication 
required for obtaining global minima. The best strategy is to use an 
"inverse" global send communication pattern to collect the global 
minimum at one node and then send the answer to every other node. 


In summary, the revised simplex algorithms presented seem 
superior to the simplex algorithms. The revised simplex method with a 
partitioned A matrix provides the best alternative for practical linear 
optimization of large problems on a hypercube because of its reduced 
memory requirement. The speedups obtained show the feasibility of 
using hypercubes for linear optimization, particularly because many 
practical problems are far larger than tested here, and because the ratio of 
communication to computation costs were relatively high for the Intel 
iPSC. 
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Abstract 


We present a multiprocessor realization for sequential 
dynamic programming problems defined on a state space 
which is represented by a directed acyclic graph. Our 
approach applies, in particular, to problems in which an 
upper and lower bound on the (intially unknown) cost 
can be determined for each node in the search graph as 
soon as it is generated, a framework often referred to as 
an “informed model’. We demonstrate how a recursive 
relationship between the bounds on successive states 
can be exploited to develop a technique in which state 
space generation and pruning are carried out in an asyn- 
chronous homogeneous manner on a loosely coupled 
architecture. The process is controlled by a message 
passing scheme utilizing three basic message types for 
(1) generating and (2) pruning nodes, and (3) backing 
up costs. Issues related to the design of the messages, 
correctness of the approach, and potential problems cre- 
ated by race conditions and deadlock are discussed. Re- 
sults obtained in the context of a large scale sequential 
(Markov) decision problem are presented which indicate 
that near 100% efficiency in the use of processors can 
be achieved. Because many common problems in game 
playing, combinatorial optimization, and discrete state 
optimal control can be adapted to an informed model 
framework, the techniques presented here are quite gen- 
eral in their potential application. 


1 Introduction 


Background. The techniques presented here apply to 
any member of a broad class of discrete-state discrete- 
time sequential decision problems defined on a directed 
acyclic graph, referred to as a decision graph. The de- 
cision graph is an explicit representation of state space 
and possible state transitions. It has a node associated 
with each state in the system and arcs from a given 
node N, to another node No if there exists a means by 
which the system can transit from the state associated 
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with node NN, to the state associated with node No. A 
certain subset of the set of all nodes in the system are 
distinguished as decision nodes from which a controller 
has the ability to influence the next state to which the 
system will progress. Other nodes (outcome nodes) rep- 
resent states in which an opponent or nature has control 
over the next state transition. 

There is a real-valued cost c( NV) associated with each 
node N. The cost associated with each leaf or terminal 
node is assumed to be directly computable. The cost 
associated with a non-terminal node, N, is given by 
a (recursive) function of the costs associated with the 
set, SUCC(N) = {Nj, No,..., N,}, of successors of NV. 
For example, if the objective is to minimize the overall 
cost!, then the cost of each decision node is given by: 
e(N) = min{c(N1), c(N2), ..., c(N,)}. 

Minimax? [1] and discrete space and time Markov 
[2,3] decision processes are perhaps the most prevalent 
examples of sequential decision processes, although, in 
the most abstract sense, any combinatorial optimization 
problem, e.g. an integer program solved using implicit 
enumeration (branch and bound), could also be consid- 
ered a form of polyadic sequential decision making in 
which all nodes are decision nodes. For a Markov decti- 
sion process the cost associated with an outcome node ts 
given by the weighted average: c(N) = S>;_, pic(Mi), 
where p; is the probability of making the transition from 
state NV to state N;. For a minimax decision process, 
the cost associated with an outcome node is given as 
c(N) = max{c(N1), c(N2),...,c(V,)}. 

We will assume here that the decision graph is finite, 


and that there exists a root node which corresponds to 

1 as will be assumed, without loss of generality, in all further 
discussion 

2In most discussions of minimax decision processes, decision 
nodes are maximizing nodes. In all references to minimax in 
the sequel, we assume the opposite in order to maintain uni- 
formity with discussions on Markov decision processes and our 
application as discussed in section 4. Extension to the case in 
which decision nodes are maximizing should be obvious. Sim- 
ilar extensions can easily be made to the equivalent negamax 
and negamin formulations. 


a unique starting state for the process. A generative 
dynamic programming algorithm is used to determine 
what action to apply at every decision point in order 
to minimize total cost. The process [1] involves an ex- 
pansion queue (or “open” list) to which a mechanism 
for generating the state of all successors of a node is 
applied according to some strategy to generate the de- 
cision graph®. The part of the decision graph that has, 
at a given point in time, already been generated is called 
the search graph. The optimal decision graph is a sub- 
graph of the decision graph generated from the root 
node and all optimal decisions from that point. 

Informed Models. In an informed model [4] each 
node, N, in the state space is assigned an upper bound, 
U(N), and a lower bound, L(N), on its actual cost 
c(N) as soon as it is generated. We require that 
U(N) > c(N) > L(N) by definition, When U(N) = 
L(N) = c(N) the node N is said to be fathomed. Fi- 
nally, we will assume that U(.N) = L(N) = c(N) for 
any terminal node NV. 

Recursive formulation of bounds. We assume that 
nodes and bounds are generated and updated at dis- 
crete points in time. We denote the “current” up- 
per, lower, and cost values for node N at time t by 
U*(N), L'(N), and c(N). There is a recursive relation- 
ship that exists between the updated bounds for node N 
and the bounds of its offspring. Assume the cost c(V) 
is given as some function (depending on the node type) 
F(N1, No,...,N,) of the successors, Nj, No,..., Ny, of 
N. Then we have [4]: 


Uttl(N) =a 
min{77*(N), f(U'(M,), U*(N2), ..., U(N,))}, 


and, 


L't*(N) a 
max{L'(N), f(L'(N1), L*(N2),..., L4(N-))}- 


For the Markov decision process we have, 


U't1(N) = min{U'(N), min  {U*(N,)}}, (1) 


N,€SUCC(N) 
and, 


Ltt*(N) = max{L'(N), min  {L'(N;)}}, (2) 


s€ SUCC(N) 


if N is a decision node, and, 


U't!(N) = min{U'(N), a 


N;€SUCC(N) 


Pi U'(Ni)}, 


3The process of expanding a node is assumed here to be 
atomic, although a more general case can be considered. 


and, 


L'*(N) =max{L'(N), >> 
N:€SUCC(N) 


Pi L*(Ni)}, 


if N ts an outcome node. 

For a minimax decision process, equations 1-2 give 
the recursive formulation at a decision (in our case, min- 
imizing) node, and we have 


U'+l(N) = min{U'(N), 
TE eee e Fig 


{U"(Ni)}}, (3) 
and, 


t+1 = t 
TO A = mal Ns 9, BUCE CN) 


{L'(Ni)}}, (4) 


if N is an outcome node. 

Pruning. The upper bound is a non-increasing, and 
the lower bound is a non-decreasing, function of ¢. Thus 
we can use the bounds associated with nodes in an in- 
formed model to curtail generation of the entire state 
space. Let N be a decision node, and assume the 
process generates a tree. If L'(N;) > U'(N;) for 
Ni, N; € SUCC(N), the node N; can be “removed 
from further consideration”, since no optimal decision 
tree will contain it. For minimax decision processes, a 
similar relationship, i.e, U(Ni) < L(N;) can be used 
to prune the successor N; from an outcome node N, 
provided that we assume that our “opponent” is work- 
ing with the same objective function. In contrast, in a 
directed graph, a node may have more than one prede- 
cessor. As such, it cannot be assumed that once a node 
becomes pruned (or, “deactivated” ) that it will not be- 
come the offspring of some node (hence, “activated” ), 
possibly in the optimal decision graph, at some future 
point in time. This has significant implications for the 
design of our approach, as discussed below. 

Relation to other work. The process discussed above 
defined on informed models provides a generalization 
for most common search strategies, because we can al- 
ways assign U(N) = +00 and L(N) = —oo for any 
non-terminal node N when it ts first generated. In 
this case, for the minimax decision processes, if bounds 
are backed up according to equations 1—4 and pruning 
is implemented as discussed above, the familiar alpha- 
beta pruning strategy results*. (Refer to [5,6,7,8] for 
other discussions on parallel implementations of alpha- 
beta pruning). For branch and bound enumeration of 
integer programs we can assign as an upper bound (in 
a minimization problem) the objective value of any in- 
cumbant solution or the objective value of any feasible 
completion of a given partial solution as generated for 

4 Certain aspects of alpha-beta, particularly deep cutoffs, re- 


quire minor extensions to the message passing scheme proposed 
here. | 


example by a “greedy” algorithm. Linear programming 
relaxations are often used to generate lower bounds. 

Parallel search algorithms have received treatment in 
a broad range of contexts [9,10,11,12], although, for 
the most part, the assumption is that the underlying 
process generates a tree. Hence the focus of attention 
tends to be on issues that differ from those discussed 
here. Most notable are the problem of mapping nodes 
to the interconnection topology of the processors in a 
manner that preserves adjacency, or focusing processor 
resources on specific areas of the search tree. In [6,7,12] 
issues related to anomalies that occur in parallel search 
are discussed. Often, search processes that endeavor 
only to find a feasible solution (e.g. a “path” through 
a state space from a start state to a goal, or a proof 
tree for a predicate logic theorem) will be represented 
as numerical optimization problems in order to guide the 
search process itself [13]. In this case, the techniques 
discussed here are also applicable. 

Search algorithms defined on informed models for spe- 
cific problem areas have received some study, particu- 
larly among the Al community. Most notably, Berliner’s 
B* [14] and descendent algorithms [15,16,17,18] show 
performance gains over earlier alpha-beta variants. Re- 
cently, Ibaraki, et al. [4] have generalized the analysis 
of informed models, and introduced the algorithm H* 
which is a member of a class of algorithms that typi- 
cally outperform alpha-beta. We are aware of no other 
work treating the parallel implementation of search pro- 
cesses defined on informed models, in particular, when 
the state space for the process is a directed acyclic 
graph, which is the most general case. 

Outline. In the sequel, section 2 defines the message 
passing approach, section 3 contains a discussion on the 
correct operation of the procedure we have defined, and 
section 4 covers results derived from an implementation 
of our approach. A synopsis is given in section 5. 


2 Approach 


In our approach, a hashing function, applied, e.g., to the 
binary representation of the state vector for the node, 
is used to determine which processor is responsible for 
a node once it is generated. Clearly it would be more 
desirable to map nodes participating in a predecessor- 
successor relationship to adjacent processors. But be- 
cause each node may have more than one predecessor, 
this is difficult to do in general without very specific 
knowledge of the topology of the decision graph. In 
order to realize the distributed algorithm, three basic 
message types are required, corresponding to the basic 
actions required in the state space generation and prun- 
ing. They are: (1) add predecessor link, (2) remove 
predecessor link, and (3) re-evaluate bounds. 


ADD PREDECESSOR 


IF the target node has not been created THEN 
create the node with predecessor link to sender 
and put node on the expansion queue 

ELSE 
add predecessor link back to sender 
IF the offspring node is marked inactive THEN 

mark the node as active 
IF the node has not been expanded THEN 
put node on expansion queue 
ELSE 
send add predecessor messages 
to all successors 


Figure 1: THE ALGORITHM FOR THE ADD PRE- 
DECESSOR MESSAGE HANDLER. 


The ADD PREDECESSOR message. The purpose 
of the add predecessor message is to (re-)establish a 
link from a successor (offspring) node to a parent (pre- 
decessor). The receipt of an add predecessor message 
indicates to the offspring node that there ts a predeces- 
sor in the decision graph that (1) is in the current search 
graph, and (2) requires the minimal expected cost (as 
well as the current upper and lower bounds) associated 
with the offspring node in its current and future compu- 
tations. As such, if the offspring node is inactive at the 
time the message ts received, it must re-activate itself in 
order to resume the on-going process of generating more 
and more refined upper and lower bounds, leading even- 
tually to a fathomed node. If the node has not yet been 
created, then the host processor must create the target 
node, generate its upper and lower bounds, and put it 
on the expansion queue. The algorithm for handling an 
add predecessor message is given in figure 1. 


The REMOVE PREDECESSOR message. A re- 
move predecessor message is sent from a parent (pre- 
decessor) node to an offspring (successor) node when 
the parent node no longer needs current and updated 
state information from the offspring node in calculating 
its current state. This can occur for two reasons: (1) 
the parent node has become inactive, or (2) the branch 
of the search graph rooted at the offspring node has 
become pruned. 


Upon receipt of a remove predecessor message, the 
offspring node removes the link back to the (sending) 
predecessor node. If the offspring node remains active 
(i.e., it still has predecessors in the current search graph) 
then no further action its required. However, removal of 
the link back to the parent node may reduce message 
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REMOVE PREDECESSOR 


Remove link to predecessor node 


IF list of predecessor nodes is now empty THEN 
IF the node is on the expansion queue THEN 
remove the node from the expansion 
queue 
ELSE 
send remove predecessor messages to all 
offspring 


Figure 2: THE ALGORITHM FOR THE REMOVE 
PREDECESSOR MESSAGE HANDLER. 


traffic by removing a potential channel along which re- 
evaluate bounds messages can pass. Otherwise, if the 
list of predecessor links becomes empty as a result of the 
removal of the link, then the node becomes inactive. It 
removes itself from the expansion queue if it is on it, 
or else it sends remove predecessor messages to all of 
its offspring. The algorithm used by the handler for a 
remove predecessor message is given in figure 2. 

The RE-EVALUATE BOUNDS message. The re- 
evaluate bounds message indicates to the receiving (par- 
ent) node that the bounds of the sending (offspring) 
node have changed, and hence it is appropriate for the 
parent to recalculate its bounds according to the re- 
cursive formulation given in section 1. The re-evaluate 
bounds message is sent when the bounds of an offspring 
node have changed. This can occur either if the off- 
spring node itself has received a re-evaluate bounds mes- 
sage, or when the bounds are set for the first time (i.e. 
when its data structure is established). The algorithm 
used by the handler for a re-evaluate bounds message Is 
given in figure 3. 


3 Correctness of the Approach 


Several potential problems exist with processes defined 
on graphs that do not occur in the context of prob- 
lems defined on a tree. Observing the message behavior 
around a typical node reveals that it may be generated, 
deactivated, and reactivated several times during the 
course of expanding the decision graph. We must show, 
therefore, that when (and if) the process terminates, any 
node in an optimal decision tree is active. Furthermore, 
we must show that there is no possibility for “infinite 
message loops’, caused by a sequence of message initi- 
ations that eventually “loop back” to the original node 
in the sequence, thereby resulting in a ever-increasing 
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RE-EVALUATE BOUNDS 


Re-calculate the upper and lower bounds according to 
the recursive formulation for the receiving node 


IF the bounds have changed THEN 
send re-evaluate bounds messages to all 
predecessors 

IF one or more successors can be pruned THEN 
send remove predecessor messages to those 
successors 

IF the upper bound is equal to the lower bound THEN 
mark the node as fathomed 


Figure 3: THE ALGORITHM FOR’ THE 
RE-EVALUATE BOUNDS MESSAGE HANDLER. 


flow of message traffic. In the following discussion we 
show that the message passing scheme proposed here ts 
guaranteed to produce the correct result, and terminate 
in a finite amount of time assuming the decision graph 
is finite®, 

In the sequel, the following strategy will be used. First 
we define the concept of the state of the system at a 
given time t. We then show that the system has a finite 
number of states, provided the number of nodes in the 
decision graph is finite. Next we show that the state 
transition diagram is finite. From these two results it can 
be concluded that the algorithm will terminate, provided 
that the number of message initiations is finite. We will 
then show that at termination, all nodes on the optimal 
decision graph are fathomed, and hence the root node Is 
fathomed. Finally, we show that the number of message 
initiations is always finite. This ts done by first showing 
that the result holds when the process generates a tree, 
and then extending this result to the case where the 
process generates a graph. 

Definitions. The state of a node N at a point in time 
t will be given as 


S'(N) = (U'(N), L'(N)) 


where [/'(N) (respectively, L'(.N’)) is the upper (re- 
spectively, lower) bound of node N at time ¢. If node 
N has not been generated (i.e., not in the current 
search graph) at time t, then we set U'(N) = +00 
and L'(N) = —oo. The distance between two nodes 
in a directed graph is the length of the shortest (in this 
case, directed) path between the two nodes. The depth, 


5 Many of the arguments provided here are given in outline 
form only due to space limitations. Full details can be found 
in [19]. 


d(NV), of a node WN in a directed acyclic graph is the dis- 
tance from the root node to N. The separation of two 
nodes is the length of the longest direct path between 
them. The sound, o(N), of a node N is the separation 
between the root node and N. 


3.1 Finite Termination 


Lemma 3.1. The number of states for any node in the 
system is finite. 

Proof: The proof proceeds by induction on the sound 
of a node N in the decision graph. The assertion 
holds for any leaf node Nz since it can be in only two 
states: S*( Nz) = (+00, —00) if it is not in the current 
search graph, and S*(N,) = (U(Nz), L(Nz)) other- 
wise, where U(N;) and L(Nz) are the bounds gener- 
ated by the initial bound generation procedure®. Since a 
leaf node has no successors, its upper and lower bounds 
will never be changed once it is generated. This provides 
the basis. 

Assume the result holds for all nodes at a sound 
greater than or equal to s, and let o(N) be s — 1. 
Note that for any successor N; of N, we must have 
o(N;) > s, so N; has a finite number of states by the 
inductive hypothesis. The set of states for node N is 
contained in a set generated by the application of the 
recursive formulation for the costs to the cross product 
of the set of states for each of its successors. The latter 
set is finite from the inductive hypothesis. a 

Theorem 3.2. The number of states for the system is 
finite. 

Proof: By its definition: 


{S':0<t< cof C xp {S*(N):0<t < oo}. 


Every element in the cross product is finite by lemma 
3.1, and there is a finite number of nodes in the decision 
graph. | @ 

Theorem 3.3. The state transition diagram for the 
process is acyclic. 

Proof: The upper bound for any node is a non- 
increasing function, and the lower bound for any node 
is a non-decreasing function of t. Hence each element 
in the sequence of states through which a given node 
passes is unique. a 

Because the number of states is finite and the state 
transition diagram is acyclic, we have the following. 

Theorem 3.4. Given that the number of nodes in the 
decision graph is finite, then the system will reach a 
terminal state in finite time, provided that the number 
of message initiations is finite. 8 


® As noted above, it is reasonable to assume that U(N;,) = 
L( Nz) = e(N_z) for any leaf node N, that is, that leaf nodes 
are fathomed upon generation. This is not required for this 
result, however. 
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Next we wish to show that when the system reaches 
a terminal state, the root node is fathomed. That is, 
the upper bound for the root node is equal to the lower 
bound. First we first show that the optimal decision 
graph is always generated. 


Theorem 3.5. Let Ny and No be two nodes in the 
search graph such that Nj, is a parent of No and that 
(1) Ny and Nz are both in an optimal decision graph, (2) 
there is a directed path in the current search graph from 
the root node to Ny, that is contained entirely within an 
optimal decision graph, and, (3) there is an arc from Ny 
to Nz in the current search graph. Then, the arc from 
Nz to No will never be removed and the node Ny will 
never become inactive. 


Comment: This result can be established [19] by in- 
duction on the depth of a node in the search graph. m™ 


Corollary 3.6. If the node N» in the conditions for 
theorem 3.5 is on the expansion queue, it will never be 
removed from the expansion queue. 


Theorem 3.7. At any point in the operation of the 
distributed algorithm, as above, either (1) there is at 
least one processor, with at least one node in an optimal 
decision graph on its expansion queue, or (2) the optimal 
decision graph is generated entirely in the current search 
graph. 

Comment: This result can be established by a simple 
inductive argument on the sequence JT’ = to, f1,..., tn, «.. 
of times at which the system changes state. a 


Theorem 3.8. The optimal decision tree is always 
generated in its entirety by the distributed algorithm. 


Proof: This result follows directly from corollary 3.6 
and theorem 3.7. | 


Theorem 3.9. Upon termination, every node in the 
optimal decision graph is fathomed. 


Proof: The proof ts again by induction on the sepa- 
ration from the root node to a particular node N in the 
optimal decision graph. For any leaf node, the result 
follows directly from theorem 3.8, that is, when the leaf 
node is generated it will be fathomed according to our 
basic assumptions as stated above. 


For a node, N, at a sound s — J. consider the list, 
N,, N2,..., N, of its successor nodes. Let N; be the 
last of these successors to become fathomed. When that 
happens, a re-evaluate bounds message Is sent to each of 
its predecessors, including node NV. Upon receipt of the 
re-evaluate bounds message, the node JN will re-evaluate 
its bounds. Since all of its offspring have been fathomed 
at this point, the node WN itself becomes fathomed. m 


This result shows, .in particular, that the root node is 
fathomed when the system reaches a terminal state. 


3.2 Finite Message Initiation 


Most of the discussion presented above has been pred- 
icated upon the fact that the message initiations are 
finite. We now show that this holds. 

Theorem 3.10. In the case where the process gener- 
ates a search tree, the number of messages of each type 
that are generated is finite. 


Proof: In a search tree, each node will have at most 
one predecessor (parent). Because the upper bounds 
are non-increasing, and the lower bounds are non- 
decreasing, once a node becomes inactive in a search 
tree, it remains inactive for all future time. Hence, a 
node is added and removed from the parent list of an 
offspring node at most once. As such, there can be 
no more than one add predecessor and one remove pre- 
decessor message per node. Since, by assumption, the 
number of nodes is finite, the number of add and remove 
predecessor messages Is finite. 

A re-evaluate bounds message is sent either as a result 
of a state change or in response to an add-predecessor 
message. The number of states is finite, as are the 
number of add predecessor messages. Hence, the num- 
ber of re-evaluate bounds messages that are initiated 
are finite. a 

Theorem 3.11. In the case where the process gener- 
ates a decision graph, the number of message initiations 
of each type is finite. 

Outline of Proof: Any directed acyclic graph can be 
“unwound” to a tree which is in some sense “equiva- 
lent” to the original graph. This is achieved by duplicat- 
ing nodes with multiple parents, giving each parent an 
equivalent copy of the original node. This process pro- 
ceeds recursively, starting at the root node and working 
down the graph until no nodes with multiple parents re- 
main. Because the state vectors are duplicated when the 
nodes are, the decision process defined on the equivalent 
tree is the same as that defined on the original graph, in 
the sense that (1) the optimal cost associated with each 


node in the graph is the same as the optimal cost as-. 


sociated with the equivalent nodes in the decision tree, 
and (2) the optimal decision graph maps to the optimal 
decision tree. 

In the strategy used to prove this theorem, we show 
that the number of message initiations are bounded by 
the number of message initiations for the equivalent 
search tree. Consider first the add predecessor mes- 
sages. Assume that there there is a node, N which dur- 
ing the course of expanding the search graph receives 
two add predecessor messages from the same predeces- 
sor N,, (otherwise, there would be at most one add 
predecessor message sent along each arc in the decision 
graph, which is finite by assumption). From the condi- 
tions under which the add predecessor message ts initi- 
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ated (figure 1), it can be shown that for this to happen 
there must exist: | 


(1) adirected path from some ancestor N, of N,, along 
which least two sequences of add predecessor mes- 
sages has been transmitted, leading to the two add 
predecessor messages from Ny to N, and, 


two distinct predecessors No and No’ of Ny such 
that an add predecessor message from No to Ny 
was responsible for generating the first of the two 
sequences, and an add predecessor message from 
No’ to Ny was responsible for generating the sec- 
ond of the two sequences. 


(2) 


Because of the procedure used in generating the 
equivalent search tree from the search graph, the two 
sequences of add predecessor messages would therefore 
occur on distinct branches of the tree, one rooted at 
No and the other rooted at No’. Using this fact as a 
basis, it can then be shown that there can be at most 
one message sequence in the graph for each message 
sequence in the corresponding tree, which, from theo- 
rem 3.10 establishes the result for the add predecessor 
messages. 

The proof for the remove predecessor messages is sim- 
ilar. As in theorem 3.10, the fact that a finite number 
of re-evaluate bounds messages are sent is established 
from the fact that the add and remove predecessor mes- 
sage transmissions are finite and the assumption that the 
state space is finite. = 

Race Conditions and Deadlock. In general, race 
conditions created by messages between two nodes ar- 
riving out of sequence will not create a problem for the 
operation of the distributed algorithm being discussed 
here, although some precautions must be taken as out- 
lined below. 

First, if messages between two specific nodes are cer- 
tain to arrive in the same order they are sent, then it 
is easy to show [19] that the process will always gen- 
erate the correct result regardless of the order in which 
those messages arrive relative to messages sent by other 
nodes. In practice, it is difficult to guarantee this, unless 
special precautions are taken in the communication pro- 
tocols. Arguments very similiar to that used in the proof 
of theorems 3.10 and 3.11, can be used to show that the 
order of arrival of re-evaluate bounds messages is still 
not important. But problems can occur in sequences of 
add and remove predecessor messages. 

For example, consider a node N, and tts offspring No, 
and a sequence of: 

add predecessor, remove predecessor, add predecessor 
which is assumed to be received as, 

add predecessor, add predecessor, remove predeces- 
sor. | 


If the second add predecessor message is ignored by 
node No», then upon receipt of the remove predecessor 
message, node Ny» will remove the link back to Ny. Any 
subsequent re-evaluate bounds messages will therefore 
not be propagated from N2 to N, and it is possible that 
the final result will be incorrect. In our implementation, 
these types of problems are managed by maintaining a 
count of the number of add predecessor messages minus 
the number of remove predecessor messages received 
from a given predecessor, in the successor node. If the 
count is zero or negative the successor considers itself to 
be pruned. Otherwise, it is not, and re-evaluate bounds 
messages are propagated back to it. 

Deadlock does not occur in the context of the 
scheme presented here because the processor activities 
are driven by the work in the message handling and ex- 
pansion queues, and no message explicitly requires a 
reply before its desired “effect” on the search graph is 
completed. 


4 Results 


The results reported here apply to a large scale resource 
allocation problem modeled as a Markov decision pro- 
cess. The specific application is to a problem in naval 
command and control. The process of solving the prob- 
lem involves expanding a decision graph to enumerate 
all possible ways (in the worst case) in which resources 
can be allocated to tasks over time. The reader is re- 
ferred to [20,21] for more detailed discussions of the ap- 
plication. The message passing scheme discussed above 
was implemented in “C” on a BBN Butterfly parallel 
processor’. The Butterfly machine consists of processor- 
memory units, with memory on any processor accessi- 
ble to any other processor through a delta network as 
described in detail in several recent reports [22,23,24]. 
Although the Butterfly is a tightly coupled machine, our 
implementation simulates a loosely coupled environment 
that has a crossbar interconnection topology. 

In our emulation of a loosely coupled architecture, 
each processor's outbound message packets are stored 
directly into a buffer on the receiving processor, then 
enqueued for processing. Memory access time across 
the switch has been reported to be 4 to 7 times slower 
than access to local memory, and contention for mes- 
sage buffer space (controlled by semaphores provided 
with the Butterfly’s “Chrysalis” operating system) cre- 
ated other delays in the transmission of messages. How- 
ever, no further delays were introduced into the simu- 
lated “channels”. Although it might be argued that the 
results discussed here may not accurately reflect the sit- 


"Butterfly is a trademark of Bolt, Beranek, and Newman, 
Advanced Computers Inc. 
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uation that would be obtained in a truly loosely coupled 
environment, there are good arguments to the contrary 
as discussed further below. We are currently implement- 
ing the approach on a network of Transputer® processing 
elements, which should provide a resolution to this issue 
in the near future. 


There are several different measures for the perfor- 
mance of parallel algorithms. We will use relative ef- 
ficiency of parallelism, E,, as defined in [25], which 
gives EF, = 22 where Sp is the speedup factor over a 
single processor obtained by dividing the elapsed time of 
the message passing algorithm on one processor by that 
on p processors. We present results related to efficiency 
as a function of problem size (number of nodes in the 
search graph) and the number of processors in the sys- 
tem. To this end, experiments were conducted against 
a range of problem sizes, by varying the value of the 
major input parameters. Data was collected for a range 
of problems resulting in search graphs with 6 to 1832 
nodes, with the number of processors ranging from 1 to 
15. Memory limitations of the system configuration on 
which the initial experiments were conducted precluded 
the generation of larger search graphs, which would have 
been highly desirable®. Nonetheless, the trends from the 
data that are available are fairly clear, and hence we be- 
lieve, will be consistent with results derived from larger 
experiments. 


A search graph consisting of 1832 nodes requires 
about 133 seconds to generate on a single Butterfly 
processor unit (a BPN2 module which consists of a 
Motorola 68020/68881 processor), or about .072 sec- 
onds/node. The time required to generate a node in 
the search graph decreases with smaller problems in our 
application, primarily because the size of the state vector 
associated with each node is a function of the problem 
inputs. For example, a search graph consisting of 81 
nodes requires 1.79 seconds on a single processor, or 
about .022 seconds/node. 


Figure 4 shows efficiency as a function of nodes gener- 
ated for various numbers of processors. Figure 5 shows 
efficiency as a function of number of CPUs for various 
fixed “problems” (as defined by the problem inputs). 
Note that both graphs exhibit trends that are more char- 
acteristic of a problem running on a pipelined architec- 
ture rather than a distributed asynchronous algorithm, 
the important factor apparently being the number of 
nodes per processor (figure 6). 


8 Transputer is a trademark of INMOS, Inc. 

° This is a reflection of the size of the state vector for nodes 
in our application, which tends to be very large. We will be 
conducting experiments and reporting results on larger systems 
in the near future. 
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Figure 4: EFFICIENCY AS A FUNCTION OF THE NUM- 
BER OF NODES IN THE SEARCH GRAPH FOR VARIOUS 
NUMBERS OF PROCESSING ELEMENTS IN THE SYS- 
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Figure 6: EFFICIENCY AS A FUNCTION OF THE NUM- 
BER OF NODES PER PROCESSING BLEMENT. 


5 Discussion 


We have presented an overview of an approach to solv- 
ing sequential decision problems in a loosely coupled en- 
vironment. The approach is based on a message passing 
scheme, using three basic message types. We have out- 
lined a proof of the correctness of the algorithm and 
discussed issues related to race conditions that can po- 
tentially occur in the passing of messages. We have 
demonstrated initial results that indicate that the ap- 
proach yields a very high degree of processor utilization 
for problems of a sufficient size. 

Note how efficiency asymptotically approaches 100% 
as the number of nodes/processor increases. Boundary 
conditions existing during the early stages of the gen- 
eration of the search graph (that is, when the first few 
nodes are being generated, and the message and ex- 
pansion queues for all but a few processors are empty) 
dominate when this ratio is small, yielding poor relative 
efficiency, but become less of a factor when the search 
graph is very large. The initialization effect is further ex- 
acerbated by the hashing scheme for distributing nodes 
among the various processors, which tends to provide a 
more uniform distribution of nodes to processors as the 
number of nodes in the search graph becomes large. For 
small problems, even small variances in the number of 
nodes per processor can mean that the expansion and 
message queues of a processor become empty so that 
the processor becomes idle. 


It has often been reported that in many distributed al- 
gorithms implemented on loosely coupled architectures, 
communication overhead becomes the dominant factor 
as the problem size and number of processors grow large. 
This effect depends on (1) the extent to which commu- 
nication between processors introduces a synchronizing 
or “serial” component to the process, (2) the relative 
amount of processor overhead required for transport- 
ing and routing messages, and (3) the extent to which 
communication bottlenecks occur. The first factor is 
not a significant issue in our approach, because no ex- 
plicit synchronization is necessary. Furthermore, except 
for the initial phase of search graph generation, each 
processor maintains a “back-log” of work, so that the 
amount of time a message takes to transit the network 
should not have a significant impact on overall efficiency. 
This explains our conjecture that overall efficiency will 
not be seriously degraded in an implementation on an 
actual loosely coupled system, at least as far as our ap- 
plication is concerned. The effect of the second factor 
can obviously be minimized by an appropriate hardware 
design, which off-loads message routing responsibilities 
from the processors. The last factor is dependent on 
the interconnection topology of the system. 

In determining the significance of the approach dis- 
cussed here, it is important to also consider how it com- 
_ pares to the best, or even commonly available, single 
processor algorithms!°. We find that the message pass- 
ing scheme proposed here when run on a single pro- 
cessor runs approximately two thirds as fast as the best 
single processor scheme that we have developed to date, 
in which costs and bounds are recursively backed up the 
graph. Thus, for example, the distributed algorithm runs 
10 times faster on 15 processors than a “good” sequen- 
tial algorithm running on a single processor. The ratio 
has been observed to be more or less constant, inde- 
pendent of problem size. Furthermore, we have found 
that the ordering of nodes on the expansion queue (pro- 
ducing depth first, breadth first, and by various criteria, 
best first search) has very little effect on the run times 
for the recursive version of the algorithm. 

By profiling the runs, we have determined that the dif- 
ference is in the extra time required to build and handle 
the messages, which in our application involves a certain 
amount of effort to encode the state vector associated 
with each node, allocate and deallocate message buffer 
space, and so forth. Furthermore, the codes for handling 
messages have been implemented relatively recently and 
have not been optimized to the same extent as the codes 
for the recursive version. The important factor is that 
the extra processing time is (1) a constant factor, and 


1°To prove that a given algorithm for a complex process is 
optimal is a difficult task at best, so let us suffice with a dis- 
cussion of “best commonly available”. 
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hence does not change the relative order of complexity 
of the algorithm, and (2) is due to processing local to 
each node. It is not, for example, the result of a se- 
rial component or a synchronization effect inherent in 
the design of the approach. It would be a penalty in- 
curred by any approach for distributing the work load on 
a loosely coupled system by allocating nodes uniformly 
among the processors, since there is no other way in 
which the information about the state of an offspring 
node can be conveyed to a processor responsible for that 
offspring. Although several approaches have been pro- 
posed which maintain subsections of search trees local 
to a specific processor [12,26], they would not extend 
to processes defined on directed acyclic graphs, partic- 
ularly if little ts known about the structure of the graph 
at the outset, because a given node may have (and in 
our case, does have) predecessors from diverse sections 
of the graph. Furthermore, approaches that maintain 
locality have other disadvantages, including redundant 
search, communication, and synchronization overhead. 
The reader has no doubt noted the fact that efficien- 
cies greater than 100% are obtained in some cases. This 
effect has also been observed for other types of paral- 
lel search algorithms [6,7,12]. The explanation that has 
often been given, is that a fortuitous ordering of nodes 
on the expansion queue finds an optimal solution ear- 
lier in a multiprocessor environment. This would prob- 
ably not be the best explanation of this anomaly tn our 
case. Among other things, we find that the number of 
nodes generated in the search graph tends to be fairly 
constant!!, independent of the number of processors in 
the system. On the other hand, we have observed a 
signficant decrease — up to 37% — in the number of 
re-evaluate bounds messages for a fixed problem size 
as the number of processors are increased, and in fact, 
this decrease, factored by the amount of time required 
to generate and handle a re-evaluate bounds message 
explains the difference in elapsed time very well. One 
explanation for the reduction in the number of these 
messages is that the more parallelism that is employed 
in propagating bound constraining information between 
a node and a given ancestor, the faster its bounds be- 
come constrained. If the ancestor node is ultimately 
to be pruned, it happens sooner, on average, in the 
multiprocessor implementation and hence the channel 
between that ancestor and any of its predecessors be- 
comes closed to any further message traffic earlier in 
the search graph generation process. We suggest that 
further research of this phenomenon Is warranted. 


11 typically, within 0.3 percent 
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Abstract 


A fault-tolerant routing scheme for outerplanar networks 
is presented, which stores routing information succinctly 
and routes messages along near-shortest paths. For an n- 
node network containing ¢ node and edge faults, the total 
space and communication is O(tan) and the routings gen- 
erated are within a factor of ((a+1)/(a—1))* of optimal, 
where a > 1 is an odd-valued integer parameter. Thus the 
routings can be tuned as desired. Efficient algorithms are 
given for setting up the routing scheme. 


1. Introduction 


A primary function in a distributed network is the 
routing of message between pairs of nodes. Often, a cost 
is associated with each edge, making it desirable to route 
along shortest, or near-shortest, paths. Although this can 
be accomplished easily by storing a complete routing table 
at each of the n nodes of the network, such an approach 
is expensive, using a total of O(n?) items of routing in- 
formation, where each item is a node name. Thus, re- 
cent research has focused on reducing the amount of rout- 
ing information stored, while still retaining good routings. 
Compact routing schemes have been designed for numer- 
ous classes of networks, ranging from simple networks such 
as trees, rings, complete networks, and complete bipartite 
networks [8,9,10] to more complex networks that possess 
a certain embedding property (the simplest of which are 
the outerplanar networks) [3], and to networks exhibiting 
certain separator properties, such as the c-decomposable 
networks and planar networks [4]. These schemes examine 
the problem in the context of being free to assign suitable 
short names to the nodes at the time the network is set 
up. The idea here is to encode useful information about 
the network within the names and to then use this infor- 
mation to generate good routings. All the above schemes 
use considerably less space than complete routing tables, 
keep node names to O(log) bits, and still route along 
shortest or near-shortest paths. 

The problem of compacting routing information has 
implications that go beyond merely saving space in the 
network. The study of this problem has led to several new 
insights into the issues of naming nodes and compactly 
encoding information within node names [3,4], as well as 
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to fast sequential algorithms for computing all-pairs short- 
est paths in planar graphs [2]. 


Unfortunately, none of the above routing schemes can 
handle node and edge faults, which can invalidate the 
stored routing information and result in arbitrarily bad 
routings. Although the problem can be overcome by re- 
computing the routing information for the resulting net- 
work from scratch, this approach can involve as much as 
O(n?) communication overhead even for sparse networks 
containing a single fault. It is thus desirable to design 
compact routing schemes which can adapt efficiently to 
faults and still route well. 


In this paper we present a space- and communication- 
efficient routing scheme for the class of outerplanar net- 
works. An outerplanar network is a network which can be 
embedded in the plane so that all nodes lie on the bound- 
ary of a single face, usually the exterior face [7]. In our 
approach, the network adapts to faults by distributively 
computing a small amount of additional routing informa- 
tion which is then used in conjunction with compact rout- 
ing information for the original network to restore good 
routings. For any combination of node and edge faults 
that do not disconnect the network, our scheme restores 
near-optimal routings using only a constant amount of ad- 
ditional routing information per fault. Specifically, let t be 
the number of faults and a > 1 an odd-valued integer pa- 
rameter. Then our scheme uses a total of O(tan) items 
of additional routing information and generates routings 
which are, in worst-case, at most ((@ + 1)/(a — 1))* times’ 


longer than optimal. (Note that this bound is less than 
(a+t)/(a—t), for a >t.) Thus the routings can be made 
as close to optimal as desired by choosing a appropriately 
large. Furthermore, the additional routing information is 
computed efficiently using only O(tan) messages. 

Briefly, our approach is as follows. The worst-case 
occurs when all faults are interior edges, i.e., edges not on 
the exterior face. (As we shall see, optimal routings can be 
reinstated if all the faults are nodes and exterior edges.) 
There are now essentially two candidate paths. To choose 
between them we make use of an interesting monotonicity 
property of distance differences in outerplanar networks. 
Using this property, for each failed edge we suitably par- 
tition the exterior face boundary into a segments. Infor- 
mation about these segments is stored at each node and 
the routing from a source to a destination is performed 
based on the relative positions of the segments containing 
the two nodes. 


A noteworthy feature of our scheme is that if there is 
just one interior edge fault among the ¢ faults, then the 
additional information can, in fact, be precomputed at the 
time the network is set up. For each interior edge, informa- 
tion about the a segments is precomputed to handle the 
potential failure of the edge. This information is stored at 
the endpoints of the edge and is broadcast through the net- 
work when the edge fails. We give an efficient sequential 
algorithm to precompute this information for all interior 
edges in O(an log n) time. 

Throughout we model our network by an undirected 
graph. (For graph-theoretic terms not defined here, see 
(7].) In the next section we review the compact routing 
scheme for fault-free outerplanar networks presented in [3], 
which is a component of our fault-tolerant scheme. Sec- 
tion 3 describes the fault-tolerant routing scheme. Due to 
space constraints, most proofs are either omitted or ab- 
breviated in this preliminary version. 


2. Interval routing in outerplanar networks 


We first summarize the interval routing method pre- 
sented in [8] for trees and rings. The nodes are named 
appropriately with the integers from 1 to n. For trees, the 
names are depth-first numbers. For rings, the names are 
assigned consecutively, going clockwise around the ring. 
For any vertex v of degree d, let wi,w2,...,wa be the 
neighbors of v indexed in clockwise order around the ex- 
terior face starting from v. Each edge incident with v is 
labeled by an interval, with the intervals from all edges in- 
cident with v forming a partition of [1,n]—v. Wraparound 


is allowed in the intervals. For instance, the interval [7, 7), 
t > j, contains {7,7 +1,...,n,1,...,7 —1}. Denote the 
intervals by [l;,li41), for 7 = 1,2,...,d, where Ig41 = v, 
and let interval [l;,1;41) label edge {v,w;}. This inter- 
val labeling has the property that {v,w;} is the first edge 
on a shortest path from v to any node whose name is in 
(li, li41). The values 1;,2 = 1,2,...,d, are stored in a table 
at node v, each with a pointer to associated edge {v, w;}. 
When a message arrives at node v, if its destination u is 
not equal to v, then the table is searched for the entry 1; 
such that [; < u < Jj4,. The message is then sent out on 
edge {v,w;}. Since the values /J;, 1 = 1,2,...,d +1 form 
a rotated list [6,1], the table can be searched in O(log d) 
time using a modified binary search. 

As the following theorem shows, the interval labeling 
method also works for outerplanar networks under a suit- 
able naming of the nodes [3]. The nodes are assigned inte- 
ger names from 1 to n in consecutive order by proceeding 
clockwise around the exterior face; if any node v is visited 
more than once in this traversal, implying that v is an ar- 
ticulation point of the network, then v may be named on 
any one of the visits. We call such a naming of the nodes 
a clockwise node naming. An outerplanar network with a 
clockwise node naming is shown in Figure 1. 


Theorem 1. ((3]) Let G be an n-vertex outerplanar graph 
with a clockwise naming of its vertices. For any assignment 
of nonnegative costs to its edges, the end of every edge in- 
cident with any vertex v can be labeled with a subinterval 
of [1,n] such that the edge is the first edge on a shortest 
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path from v to any vertex in the subinterval. 


Figure 2 illustrates an interval labeling of the edges 
of the network in Figure 1. 

In our fault-tolerant scheme, shortest paths routing 
information for the original (fault-free) outerplanar net- 
work, G, is stored in interval form. Moreover, the interval 
information is set up to favor an edge to a path of two or 
more edges, in the case of ties. Also, we assume that the 
edge costs of G satisfy the generalized triangle inequality, 
i.e., each edge is a shortest path between its endpoints. 


3. Compact fault-tolerant outerplanar routings 


We consider node faults and edge faults separately 
and later show how to handle combinations of node and 
edge faults. As a running example, we will use the unit- 
cost network G shown in Figure 3. 


3.1 Handling node faults 


Consider a single node fault. Let v be the failed node 
and S the set of edges which are on the faces containing v 
but not incident with v. The edges of S form a simple path 
and graph G — v is the union of a number of subgraphs, 
each defined by an edge of S, as follows. Let {a,b} be any 
edge of S such that v is in the interval (a,b). Then the 
subgraph defined by {a, b} is the induced subgraph of G—v 
on the nodes in the interval [b, a]. Call each such subgraph 
an S-component, and the nodes a and 6 the gateway nodes 
of the S-component. Note that the nodes in the subgraph 
attached to a are in the interval [a,v) and those in the 
subgraph attached to 6 are in the interval (v, 6]. 


Figure 4 shows the network of Figure 3 after the fail- 
ure of node 5. The edges in S are {4,2}, {2,1}, {1,8}, 
and {8,6}, shown bold. The corresponding S-components 
have nodes in the intervals [2,4], [1,2], [8,1], and [6,8] 
respectively. 


When v fails, each node w in G — v determines the 
following additional routing information: the name v, the 
names of the nodes in each S-component H to which w 
belongs, and the names of the nodes in the subgraphs at- 
tached to each gateway node of H. This is done as fol- 
lows. One of the endpoints of the simple path formed by 
the edges of S, which is a neighbor of v, sends out over 
the path a message containing the name v. Each node on 
the path forwards the message out on the path, and also 
broadcasts its own name and the name v within each of 
the S-components to which it belongs. Thus w receives 
the name v and the names a and 6 of the gateway nodes 
of H. These names v, a, and 6 succinctly represent the 
desired additional routing information for w. Since any 
node is in at most two S-components, O(1) items of addi- 
tional routing information are stored per node, hence O(n) 
overall. The total number of messages exchanged is O(n). 


As an example, node 11 in Figure 4 receives the names 
1 and 8 of the gateway nodes of its S-component, and the 
name 5 of the failed node. From these it deduces that its S- 
component consists of the nodes in the interval [8,1], and 
that the nodes in the subgraphs attached to gateway nodes 
1 and 8 are in the intervals [1,5) and (5, 8] respectively. 


The routing from a source s to a destination d is as 


follows. Let u be any node participating in the routing, 
inclusive of s and d. If u is d then the routing terminates. 
Otherwise, if u and d are in the same S-component, then 
u routes to d using interval routing. Otherwise, u uses 
interval routing to route to that gateway node of its S- 
component such that d is in the subgraph attached to this 
gateway node. 


The routing from u to d will be along a shortest 
path in G —v. Suppose that u and d are in the same 
S-component. Since the edge from S contained in this 
component satisfies the generalized triangle inequality, the 
distance between u and d in G — v equals the distance 
between them in G. Furthermore, since the interval rout- 
ing information favors an edge to a path of two or more 
edges, the routing from u to d will be along a shortest 
path confined to the component. If u and d are in dif- 
ferent S-components, then any shortest (u,d)-path must 
use that gateway node of u’s S-component to which the 
subgraph containing d is attached. The routing strategy 
correctly determines this gateway node and routes to it 
along a shortest path. 


A similar approach works for t > 1 node faults 
also. Let vi, v2,...,ve be the failed nodes. Let 5S; be 
the set of edges on the faces containing v; but not inci- 
dent with vj, 1 <j <1, and let S = Ue S;. Graph 
G' = G — {v1, v2,..., v4} is the union of a number of S- 
components, where an S-component can now contain more 


than one edge from S, but at most one from any S;. Each 


S-component H is defined by its edges of S as follows: 


Suppose that H contains an edge from each of the sets 
Si,, Sigs---,5i,, 1 < 1 < t, where the sets are indexed 
such that vj, < vi, < +--+ < v;i,. Let {a;,,b;;} be the edge 
from S;; in H such that v;, is in the interval (a;,,};;), 
1<jy<l. Then, Z is the induced subgraph of G’ on the 
nodes in [b;,,a;,] U [b;,,a:,] U--- U [b;,, a:,]. The gateway 
nodes of H are the nodes a;, and b;,, 1 <j <1. The nodes 
in the subgraph attached to any gateway node u of H are 
as follows. In general, u will be an endpoint of more than 
one edge from S;,, 5i,, ..- , Si,, and so receives labels of 
both types ‘a’ and ‘b’. For each label a;; that u receives, 
1<jy <1, a subset of the nodes in the subgraph attached 
to it are contained in the interval [a;,,v;,;). Similarly, for 
each label 6;, that u receives, a subset of the nodes in 
the subgraph attached to it are contained in the interval 
(v;;,5;;]. The set of all nodes in the subgraph attached to 
u will be the union of these intervals. 


As additional routing information, each node w of G’ 
determines the names of the failed nodes, the composition 
of each S-component H containing w, and the composition 
of the subgraphs attached to the various gateway nodes of 
H. A recovery message, containing the name v;,, is sent 
out on the simple path defined by S;,, and each node a;, 
and 6;, broadcasts its own name and the name v;,; into H, 
1<j <I. The names v;,, a;;, and b;, succinctly represent 
the additional routing information for w. The total num- 
ber of message exchanges within H will be O(¢|E(#))). 
Thus there are O(tn) message exchanges overall. Each 
node maintains O(t) items of additional routing informa- 
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tion for each S-component of G’ to which it belongs. Since 
a node can belong to at most as many S-components as 
its degree in G’, the total number of items is O(t|E(G’)|), 
which is O(tn). 

The routing strategy is exactly as before and the rout- 
ings generated will be optimal. 

Figure 5 illustrates the network of Figure 3 after the 
failure of nodes 5 and 13, with the edges of S shown bold. 
Let H be the S-component consisting of nodes 8, 9, 10, 
and 1. H contains edges {a;,,6;, } = {1,8} and {a;,,};,} = 
{10,1} from the sets $;, and S;, corresponding to vi, = 5 
and v;, = 13. Thus each node of H receives the names 1 
and 8, and the name of the associated failed node 5, and 
also the names 10 and 1, and the name of the associated 
failed node 13. From this it deduces that the nodes of H 
are.in [8,10] U[1,1]. The nodes in the subgraph attached 
to 1 are in [1,5) U[13,1), those in the subgraph attached 
to 8 in (5,8], and those in the subgraph attached to 10 in 
(10, 13). 


3.2 Handling edge faults 


An exterior edge fault can be viewed as the failure 
of a fictitious node in the middle of the edge. Thus this 
case can be handled essentially as before, with one of the 
endpoints of the failed exterior edge initiating the recovery. 

Interior edge faults are more difficult to handle, how- 
ever. Unlike node and exterior edge faults where there is 
essentially just one choice of path to route over, there are 
now two candidate paths. This introduces nonoptimality 
in the routings, since the correct choice of path is not al- 
ways apparent. However, for the routing scheme that we 
present, the quality of the routings can be improved to any 
desired extent by using a correspondingly larger amount 
of additional routing information. 

First consider a single interior edge fault. We present 
two approaches. In the first, additional routing informa- 
tion is precomputed for each interior edge at network setup 
time, to handle the potential failure of the edge, and is 
stored at one of the endpoints of the edge. We also give 
an efficient algorithm for precomputing this information. 
We then describe an alternative approach where the addi- 
tional information is computed distributively, as the faults 
occur. We then build upon the second approach to handle 
multiple interior edge faults. 


3.2.1 Handling a single interior edge fault 
Let e = {v1,v2} be the failed edge and F' the bound- 


ary of the face that results from the deletion of e from G. 
Graph G — e is the union of a number of subgraphs, each 
defined by an edge of S, as follows. Let {a,b} be any edge 
of S, with a immediately following 6 clockwise around F’. 
Then the subgraph defined by {a,b} is the induced sub- 
graph of G — e on the nodes in the interval [b,a]. Each 
such subgraph is called an F'-component and the nodes a 
and b are called gateway nodes of the F-component. _ 
Figure 6 shows the network of Figure 3 after interior 
edge {1,8} has failed. The edges {5,1}, {8,5}, {10,8}, 
and {1,10} of F are shown bold. The corresponding F’- 
components are the subgraphs induced on the nodes in the 
intervals [1,5], [5,8], [8,10], and [10, 1] respectively. 


Assume that v; initiates the recovery. This involves 
propagating additional routing information to each node. 
This information consists of the name v1, the names of the 
gateway nodes of the F-component containing the node, 
and information about a@ points on the boundary of the 
exterior face, each at a prescribed distance from vj in G—e. 
The latter is precomputed and stored at v,;. Thus @ items 
are stored for each interior edge, hence O(an) in total. 
The total number of message exchanges is O(an). 

The information precomputed at v, is as follows: For 
any nodes x and y in G—e, let x, and y, be the first and last 
nodes of F encountered when going clockwise from z to y 
around the exterior face. If no node from F' is encountered, 
then take x, and y, to be x and y respectively. A shortest 
clockwise path from x to y consists of a shortest path in 
G—e from x to 2,, followed by the edges of F going 
clockwise from x, to yc, followed by a shortest path in G—e 
from y, to y. Denote by p,(z,y) the length of this path. 
A routing along this path is called a shortest clockwise 
routing from x to y. Such a routing is easy to perform, 
since the shortest paths from zx to zr, and y, to y can 
be realized using the interval routing information for G. 
Similarly for a shortest counterclockwise path/routing from 
x to y, whose length we denote by p¢<(2, y). 

The following lemma, which is a special case of a more 
general result proved in [2], reveals an interesting mono- 
tonicity property of distance differences in outerplanar net- 
works. 


Lemma 1. View the boundary of the exterior face of the 


embedding of G — e as a continuum of points and extend 
the distance functions p,(-,-) and p.-(-,-) to points. The 
function p¢(p, v1) — Pec(p, v1) is monotonically nondecreas- 
ing for points p encountered going counterclockwise from 
v, around the exterior face. I 


Let C’ be the total cost of the edges of F. Thus 
Pcl*, V1) — Pce(-,¥1) changes monotonically from —C to 
C’ in traversing the exterior face boundary counterclock- 
wise from vj. Let a > 1 be an odd integer parame- 
ter. Precompute and store at v, the a division points 
V1 = 20,21,---,2a-1, lying on the boundary of the ex- 
terior face, where the points are indexed counterclockwise 
from v1. Division point z; satisfies p.(zi,v1)— Pec(2i, v1) = 
—C+(2C/a)i,0 <i <a-—1. It is represented as a 3-tuple 
(2,u,v), where u and v are the endpoints of the exterior 
edge {u,v} on which it falls. (If it coincides with a node 
then u and v are both the name of this node.) 

We illustrate division points with reference to Fig- 
ure 6, with a = 3. Division point zp coincides with node 
v, = 1; 2% lies on edge {9,10}, at a distance of 5/6 from 
node 10; and, z2 lies on edge {5,6}, at a distance of 1/6 
from node 6. 

The routing from a source s to a destination d is as 
follows. If s and d are both in the same F-component then 
s routes using the interval routing information for G. Oth- 
erwise, s uses the division points to route to d. Let s be 
between 2(%~1) mod a and z%, and d between 2(k(=1)- med “a 
and zy, where 0 < k,k' < a—1. In general, if a vertex co- 
incides with a division point z;, then we take the vertex to 
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be between 2(;-1) mod a and 2. If k < k' (i.e., s precedes d 
counterclockwise from v; ), then if k'—k > (a+1)/2, then 
s performs a shortest clockwise routing to d; otherwise it 
performs a shortest counterclockwise routing. If k' < k 
(i.e., d precedes s), then if k — k' > (a +1)/2, then s per- 
forms a shortest counterclockwise routing to d; otherwise 
it performs a shortest clockwise routing. 


Theorem 2. Let p(s,d) be the distance in G — e be- 
tween any nodes s and d, and let f(s,d) be the length 
of the routing generated by the above strategy. Then 


A(s,d)/p(s,d) < (a +1)/(a—1). 


Proof. If s and d are in the same F’-component, then the 
routing will be along a shortest (s,d)-path. Suppose that s 
and d are in different F-components and 6(s,d) > p(s, da). 
We claim that p(s,d) — p(s,d) < C/a. First we note that 
from the positions k and &’ of s and d, we have 


—C'+(k-—1)(2C/a) < pe(s, 01) — pec(S, V1) 
<-C+k(2C/a) 
C+ (k= 1)(2C/a) < pe(d,v1) = pee(d, v1) 
<—-C+k'(2C/a). 
The proof of the claim involves considering various cases. 
Suppose that k < k’ (i.e., s precedes d counterclockwise 
from v;), and k’ —k > (a+ 1)/2. Let wu be the first 


node from fF encountered going counterclockwise from s 
to d. Note that u exists since s and d are in different 


F-components. Clearly, pec(s, 01) = Pee(s,u) + Pec(U, V1), 
and p-(d,v1) = p-(d,u) + pc(u, v1). Substituting into the 
above inequalities gives 


CG NGe mene aaGd) pss 
<-C+4+k(2C/a) 
—C + (k' —1)(2C/a) < pe(d,u) + pe(u, v1) — pec(d, v1) 
<-C+k'(2C/a). 
Note that pec(s,d) = pec(s,u)+Pec(u, d) = pec(s,u)+ 
pe(d, u), pe(s, d) = Pc(S,%1) a Pc(v1, d) oe Pe(S, 1) a 
Pec(d, v1) and pec(u,vi) + pe(u,v1) = C. By the rout- 
ing strategy, if k’ -k > (a+ 1)/2, then f(s, d) = p-(s, d). 
Thus p(s, d) = pe-(s,d). Then 
p(s, d) _ p(s, d) = pc(s, d) 7 Pec(S, d) 
7 (p-(s, v1) i Pec(d, v1))— 
(Pec(s,u) + pe(d, u))+ 
(C 4 Pec(U, v1) as Pele, v1)) 
<(—-C + k(2C/a))— 
(—C + (k' —1)(2C/a))+C 
< C — ((a + 1)/2)(2C/a) + (2C/a) 
=C/a. 


Similarly for the case k! — k < (a+1)/2 and for k' < k, 
which establishes the claim. 


Furthermore, it can be shown that whenever p(s, d) > 
p(s, d), then p(s, d) + p(s, d) ee pe(s, d) a Pec($, 4) 2 C. 
Since p(s,d) — p(s,d) < C/a, it follows that p(s,d) > 
(C — C/a)/2. Thus 


p(s, d)/p(s,d) < (o(s,d) + C/a)/p(s, d) 
<1+ (C/a)/((C — C/a)/2) 
=(a+1)/(a—1).4 


From the proof it follows that the routing may not 
be shortest only when |k’ — k| = (a — 1)/2 and |k' — k| = 
(a+ 1)/2. We illustrate the routing strategy using Fig- 
ure 6. Let s be 12 and d be 7; thus k = 1 and k’ = 2. 
Since s precedes d counterclockwise around the exterior 
face from v; = 1 and k'—k < (a+ 1)/2 = 2, a short- 
est counterclockwise routing is performed and the message 
reaches d on a shortest path in the network, via nodes 
10 and 8. As another example, let s be 2 and d be 9; 
thus k = 3 and k' = 2. As d precedes s counterclock- 
wise from v, and k — k' < (a+ 1)/2 = 2, a shortest 
clockwise routing is performed. The message reaches 9 
on a path of length 4, via nodes 4, 5, and 8. The short- 
est path is via nodes 1 and 10, and has length 3. Thus 
(2,9)/p(2,9) = 4/3 < (w+ D(a 1) =2. 


3.2.2. Determining division points efficiently 


The division points for each interior edge can be com- 
puted easily in O(n) time, thus yielding an O(n?)-time al- 
gorithm for finding the division points for all interior edges. 
However, one can do better. We now present an algorithm 
which runs in O(anlogn) time. Our algorithm decom- 
poses G into two smaller graphs, recursively solves the 
problem on these graphs, and then combines the two solu- 
tions into one for G. The decomposition is done by iden- 
tifying a pair of vertices, called separator vertices, whose 
removal disconnects G into two subgraphs with vertex sets 
A and B, each of size at most 2n/3. The separator vertices 
can be found in O(n) time; see [5] for example, where a 
separator algorithm for the more general classes of series- 
parallel and k-outerplanar graphs is given. The division 
points algorithm is as follows. 


Assume that G is biconnected. (Otherwise apply the 
algorithm to each biconnected component.) If G is a cycle 
then there are no division points to compute (as there are 
no interior edges) and the algorithm terminates. Other- 
wise, assign weight 1/n to each vertex of G and find sep- 
arator vertices x and y of G. Infer the induced subgraphs 
G, and G2 of G on the sets AU {z,y} and BU {z, y} re- 
spectively. If x and y are not adjacent in G, then include 
a path of length two between them in G and in G2, with 
cost as follows. Let F be the face boundary in G contain- 
ing both x and y. Let F, (resp. F)) be the portion of F 
contained in G; (resp. G2). Then in G; (resp. G2), assign 
cost || F\/2 || (resp. || F2/2 ||) to each edge of the path 
between z and y. Both G, and G_ will be biconnected 
outerplanar graphs satisfying the generalized triangle in- 
equality. Also, distances between nodes in G, (resp. G2) 
will be unchanged from G. 
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Recursively solve the problem on G; and Gz. Com- 
plete the solution for G as follows. Let P be the path 
joining x and y in G, (resp. G2). If z and y are adjacent 
in G, then P is e; otherwise it is the introduced path of 
length two. Let p be any division point for interior edge 
e' of G; (resp. G2) that falls in the interior of P. Map p 
to a point on the portion of the exterior face boundary of 
G contained in G2 (resp. G1), such that the latter point 
is a division point for e' viewed as an interior edge of G. 
Furthermore, if P is e, then compute the division points 
for e as well. 


We first show how to map each point p originating 
in G, and falling in the interior of P to the boundary 
of the exterior face of G contained in G2 (the discussion 
for points originating in G2 is similar). For convenience 
extend the distance functions p,(-,-) and pec(-,-) to points. 
Corresponding to point p on P, there is a point p’ on F, 
such that p-(p,v1) — Pcc(p,v1) in Gy equals p-(p', v1) — 
Pec(p', v1) in G. In particular, if p is at distance / from 
one of the endpoints, say z, of P, then p’ is at distance 
’=1+(l]) &% || — || P ||)/2 from « along Fy. Thus, for 
each edge of F), the points p on P that map to points p! 
on that edge are consecutive on P. Determine the range of 
points on P falling on each edge of F2. Repeat this process 
for each interior edge f of Fh, determining the ranges of 
points p’ on f falling on the various edges of F3, where 
F! is the remainder of the other face boundary to which 
f belongs. From these ranges, infer the ranges of points p 
on FP falling on the edges of Fy. 


Proceeding in this fashion yields the range of points p 
on P that fall on each exterior edge of G contained in Go. 
If p is the i**® division point for interior edge e’ = {v1, v2}, 
and falls on an exterior edge {u,w}, then encode the 3- 
tuple (2,u,w) at v4. 

If P is e, then compute the division points for e as 
follows. In G—e proceed around F' counterclockwise from 
one of the endpoints of e, say z, and locate a points on F' 
at intervals of || F' || /a. For each point determine the edge 
on the exterior face on which it falls, using the information 
now available about the ranges of points on each edge of 
F that fall on the various exterior edges of G. This yields 
the division points for e. Encode each division point as a 
3-tuple. 


Theorem 3. Let G be a biconnected n-node outerplanar 
graph satisfying the generalized triangle inequality. The 
above algorithm correctly computes division points for the 
interior edges of G. Furthermore, the running time of the 
algorithm is O(an log n), where a is the number of division 
points computed for each interior edge. 


Proof (sketch). Correctness can be shown by induction 
on n, noting that the construction of G; and G2 preserves 
distances and that the mapping of a point falling in the in- 
terior of P to the exterior face leaves the distance difference 
invariant. The running time analysis is straightforward. I 


3.2.3 Computing division points distributively 


The ideas of Section 3.2.2 yield a distributed algo- 
rithm to compute the division points for any given interior 


edge using O(an) messages. When an interior edge fails, 
one of its endpoints initiates the algorithm, determines the 
division points, and then propagates these to all nodes. 
The communication overhead in this approach compares 
favorably with the overhead in the approach where division 
points are precomputed, since in the latter case, O(an) 
messages are needed simply to propagate division points 
to all the nodes. Furthermore, as we will see in the next 
section, with the distributed approach it is possible to ef- 
ficiently handle the failure of more than one interior edge. 


Assume that each each node has available the cost 
of each incident edge and the cost of each face containing 
the node. Let e = {v1,u1} be the interior edge that fails, 
and let F' be the boundary of the face resulting from the 
deletion of e. First, the nodes on F distributively locate a 
points v1 = po, Pi,---,Pa—1 at intervals of || F || /a on F. 
Next, the points falling on each edge of Fare distributively 
mapped to the portion of the exterior face of G' = G — e 
contained in the corresponding F-component. 


Let v1,02,...,Uq be the nodes of F encountered go- 
ing counterclockwise around F from v;. Let e; denote edge 
{Vi, Vi(moa qg)+1}, and let L; denote the cost of the segment 
of F' from v; counterclockwise to v;, 1 <i <q. Begin- 
ning with v,, each node v; in turn determines the range 
of the points po,p1,---,Pa—1 falling on e; as follows. It 
determines the largest integer j; such that (|| F || /a)ji < 
Li+ | 3 |. Thus, points Pj,-14+15Pji-1+2)+--> Pj; fall on e;, 
where we take jo to be —1. If j; < a—1, then v; sends a 
message to vj41, with j;, a, || F ||, and Li41 = L;+ || e; ||. 
Upon receipt of this message, vj41 proceeds to determine 
the range of points falling on e;44. 


Next, the points falling in the interior of each interior 
edge e; are mapped to the exterior edges of G’ in the cor- 
responding F-component. These are the desired division 
points. We describe this process for e,, assuming that it 
is an interior edge. The mapping is done in a succession 
of stages. In the first stage, the points in the interior of 
€; are mapped to the various edges of F,, where F; is the 
remainder of the other face boundary to which e; belongs. 
The stage is initiated by v1, which determines the farthest 
point on e; such that the distance of this point from v; 
plus (|| Fi || — || e1 ||)/2 is at most || fi |], where f, is 
the edge from F, incident with v,. This gives the range 
of points from e, falling on f;. If all points on e; do not 
fall on f;, then v; sends a message over f; to its neighbor, 
which proceeds to determine the range of points from e, 
falling on f2, where f2 is the other edge of F, incident with 
it. In this manner the ranges of points on e, falling on the 
various edges of F; are determined. In the next stage this 
is repeated for each interior edge of F,. Eventually, all 
points on e; get mapped to exterior edges. Similarly for 
points on €2,€3,..., eq. 


Each division point is encoded in its 3-tuple form by 
one of the endpoints of the exterior edge it falls on. All the 
division points are then routed to v; (for instance, along 
the boundary of the exterior face of G'). 


The total number of messages used to do the map- 
ping is O(n), since at most one message is sent over any 
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edge. The number of messages used to route the divi- 
sion points to v; is O(an). If the edge costs are integers 
and are polynomial in n, then the length of any message 
is O(logn) bits. Similarly, the total additional storage 
needed to store face and edge costs is O(n) items, where 
each item is O(log n) bits long. 


3.2.4 Handling multiple interior edge faults 


We now consider routing efficiently in an outerplanar 
network with ¢ failed interior edges, t > 1. One problem 
in trying to generalize the approach of Section 3.2.1, by 
precomputing division points for each edge fault, is that 
the computation for any edge does not take into account 
changes in distances caused by the failure of the other 
edges. Furthermore, precomputing division points for ev- 
ery combination of t failed interior edges is expensive, both 
spacewise and timewise. However, it is possible to generate 
efficient routings if division points are computed distribu- 
tively, as and when edges fail. We now present an approach 
that stores a total of O(tan) items of additional routing 
information in the network and routes with a worst case 
bound of ((a+1)/(a—1))*, which is less than (a+t)/(a—t), 
for a >t > 1. The computation of the additional infor- 
mation uses a total of O(tan) messages. 

Let e; = {vj,ui}, 1 <i < #, be the failed interior 
edges. Let G’ = G — {e1,€2,...,e¢}. Let F; be the face 
boundary that results when e; is deleted from G'Ue;. Then 
graph G’ is the union of a number of F;-components. Each 
node v; initiates the distributed algorithm of Section 3.2.3 
on G’ and computes a division points for e;. These are 
then propagated to each node in the network. Each node 
also receives the name v; and the names of the gateway 
nodes of the F;-component containing the node. The to- 
tal number of messages used to compute and propagate 
the information is O(an) per failed edge, hence O(tan) in 
total. The total space needed to hold this information is 
O(tan). 

The routing from source s to destination d is as fol- 
lows. If s and d are in the same F;-component for all 2, 
1 <2 <7, then the routing to d is performed using the in- 
terval routing information for G. Otherwise, without loss 
of generality, let s and d be in different F;-components for 
2=1,2,...,1l, where1 <1 <t. Let g; be the edge from F; 
such that s is in the F;-component defined by g;,1 <2 < l. 
Then one of the endpoints of g; will be in the interval (s, d) 
and the other in the interval (d,s). Without loss of gener- 
ality assume that when going clockwise around the exterior 
face from s to d, the endpoints of gi, 92,...,g 1 contained 
in interval (s,d) are encountered in order. Thus the g; 
successively ‘separate’ s from d. Using the division points 
for e,, s performs either a shortest clockwise or a short- 
est counterclockwise routing, as described in Section 3.2.1. 
Let a; be the endpoint of h; reached first, where h; is the 
edge of F; whose F;-component contains d,1<2<l. For 
1<:z<J/-1, each a; uses the division points for e;41 to 
perform the routing to d. Once the message reaches ay, 
interval routing information for G is used to route to d. 


Theorem 4. Let p(s,d) be the distance in G’ = G — 
{€1, €2,...,e¢}, between any nodes s and d, and let f(s, d) 


be the length of the routing generated by the above strat- 
egy. Then A(s,d)/p(s,d) < ((a+1)/(a — 1))*, which is 
less than (a + t)/(a —t), fora >t>1. 


Proof (sketch). If s and d are in the same F;-component 
for all 7, 1 <i <t, then they are both in the subgraph I 
of G' that is the intersection of the F;-components. It can 
be shown that there is a shortest (s,d)-path from G in J. 
Thus the interval routing will be along this path. 

If s and d are in different F,-, Fo-,..., Fj-components, 
1 < 1 < t, then we show by induction on / that 
p(s, d)/p(s,d) < ((a+1)/(a —1))'. The basis, | = 1, 
follows from Theorem 2. 7 

For | > 1 we have f(s,d) = f(s, a1) + p(a1,d). Now, 
p(a,,d) is the length of the routing from a, to d if G 
contains only the / — 1 interior edge faults e2,e3,---, e1. 
Thus, by the induction hypothesis, f(a1,d) < ((a+1)/(a— 
1))'-! p(a1,d). Also, p(s,a1) + p(ai,d) is the length of 
the routing from s to d if G contains only the fault e. 
Thus, by the induction hypothesis, A(s,a1) + p(ai,d) < 
((a+1)/(a—1))p(s,d). Substituting these inequalities in 
Ale, d) = Ale,a,) + Alar, d) yields Ao, d) < ((a+1)/(a— 
1))'p(s,d) < ((# + 1)/(a — 1))'p(s,d). An induction on 
t shows that ((a@ + 1)/(a —1))' < (a+ t)/(a — t), for 
a>t>1.l 


3.2.5 Handling both node and edge faults 


If all ¢ faults are nodes and exterior edges, then ad- 
ditional routing information is set up to handle these as 
described previously. The routings generated will be along 


shortest paths. However, suppose that t' of the ¢ faults 
are interior edges. The ¢ — #’ node and exterior edge 
faults cause the resulting network G' to be the union of 
S-components. Each interior edge fault will be confined to 
one of the S-components. Additional routing information 
is computed first for the node and exterior edge faults, 
and then for the interior edge faults, as before. The total 
number of items stored and the number of messages ex- 
changed will both be O((¢—?t')n+t'an). The routing from 
s to d is done exactly as it is for node and exterior edge 
failures, except that the portion of the routing which is 
within any S-component of G’ that contains one or more 
failed interior edges is done using the division points. 

In worst-case, all ¢’ interior edge faults are in a sin- 
gle S-component and all, or nearly all, of the length 
of the shortest (s,d)-path is within this S-component. 
The routing realized within this S-component is at most 
((a + 1)/(a —1))* times longer than a shortest routing 
within the component. Thus the length of the (s,d)- 
routing is at most ((# + 1)/(a~—1))* times longer than a 
shortest (s,d)-path. 
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G used to illustrate fault- 
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Figure 3. Outerplanar network 
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Abstract 


This paper presents an efficient decomposition technique which 
provides a more systematic approach in solving the optimal buffer 
assignment problem of an acyclic data flow graph (ADFG) with a 
large number of computational nodes. The buffer assignment prob- 
lem is formulated as an integer linear optimization problem which 
can be solved in pseudo-polynomial time. However, if the size of an 
ADFG increases, then integer linear constraint equations may grow 
exponentially, making the optimization problem more intractable. 
The decomposition approach utilizes the critical path concept to 
decompose a directed ADFG into a set of connected subgraphs, and 
the integer linear optimization technique can be used to solve the 
buffer assignment problem in each subgraph. Thus, a large-scale 
integer linear. optimization problem is divided into a number of 
smaller-scale subproblems, each of which can be easily solved in 
pseudo-polynomial time. Examples are given to illustrate the pro- 
posed decomposition technique. 


I. Introduction 


In designing VLSI systolic architectures for many complex 
computational tasks in pattern recognition and signal processing [6], 
and robotics (7],/8], the functional decomposition of the task into a 
set of computational modules can be represented as a directed task 
graph, and the inclusion of input data modifies the task graph to an 
acyclic data flow graph (ADFG). The nodes of an ADFG correspond 
to the computational modules, each of which can be realized by a 
linear pipelined functional unit for increasing the system throughput 
[5]. The operands or data move along the edges, each of which con- 
nects a pair of nodes. Due to a different computational time of the 
modules, data flow (both inputs and results from one module to 
another) in an ADFG may occur at different speeds in different direc- 
tions. Thus, operands may arrive at multi-input modules at 
different arrival times, causing an unnecessary longer pipelined time 
in the ADFG. A conventional approach is to insert delay buffers 
(FIFO queues) at various paths to buffer the inputs or the output 
results from one module to another to achieve a balanced (or syn- 
chronous) ADFG. The problem of balancing a directed ADFG by 
inserting appropriate delay buffers along appropriate paths to 
achieve maximum pipelining has been solved previously by the cut- 
set theorem [5],(6], the local correctness criterion g) and the graph- 
theoretic approach [2]. Furthermore, Hwang and Xu [4] showed that 
the balanced ADFG can be realized in a two-level Seine network 
which is reconfigurable and provides the flexibility in various vector 
processing applications. The delay matching may be handled by 
programmable buffers so that proper non-compute delays can be 
inserted in each data flow path. An example is the design of the 
LINC chip BI which is an 8-by-8 crossbar up to 32 units of pro- 
grammable delays in each data flow path. 


This paper presents an efficient decomposition technique which 
provides a more systematic approach in solving the optimal buffer 
assignment problem of an ADFG with a large number of computa- 
tional nodes. Since it is of vital importance to minimize the number 
of buffers used in a systolic system to minimize the design cost, the 
optimal buffer assignment problem is formulated as an integer linear 
optimization problem, which can be easily solved in computers in 
pseudo-polynomial time [9]. However, if the number of computa- 
tional nodes in an ADFG is quite large, then integer linear constraint 
equations may grow exponentially, making the optimization prob- 
lem more difficult than it should be. The construction of integer 
linear constraint equations in a large-scale ADFG reveals the 
existence of many redundant integer linear constraint equations 
which come from the path overlapping between two paths of two 
different multi-input nodes. They can be easily removed by recogniz- 
ing the overlapping path (or common path) traversed by different 
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paths. In an effort to reduce the difficulty of optimizing a large 
number of integer linear constraint equations, an efficient and sys- 
tematic decomposition technique is proposed to recognize all the 
decomposable subgraphs in an ADFG and generate their associated 
sets of integer linear constraint equations. The decomposition 
approach utilizes the critical path concept to decompose a directed 
ADFG into a set of connected subgraphs, and the integer linear 
optimization technique can be used to solve the buffer assignment 
problem in each subgraph. Thus, a large-scale integer linear optimi- 
zation problem is divided into a number of smaller-scale subprob- 
lems, each of which can be easily solved in pseudo-polynomial time. 
Examples are given to illustrate the decomposition approach and 
finally, the proposed decomposition technique is used to balance an 
interconnection of CORDIC (COordinate Rotation DIgital Com- 
puter [11]) processors to achieve maximum pipelining for computing 
the robot inverse kinematic position solution [8]. 


II. Formulation For Balancing Acyclic Data Flow Graphs 


In formulating the optimal buffer assignment problem, we shall 
assume that the number of computing stages of any computational 
module of an ADFG is finite and that the execution time of any stage 
is a constant, called a basic time unit or stage latency. An ADFG is 
maximum pipelined if the minimum number of time units needed for 
obtaining two successive outputs from the pipeline is equal to one 
basic time unit. We shall concentrate our interests on single-input 
single-output (SISO) ADFG’s and introduce some necessary 
definitions for formulation. 


Definition 1: A weighted ADFG GW=(V,E,W) 
corresponding to an ADFG G =(V , E) is a weighted directed graph 
where W is a weight function from E to a set of non-negative real 
numbers. V =(v,,v.,°°° ,v, )isa finite set of computational nodes 
(or modules), and H =(e,,e9,°°* ,€, ) is a finite set of edges. An 
edge connecting node v; to node v; is denoted by e (t , 7). 


A logical way to convert an ADFG to a corresponding weighted 
ADFG is to assign weights to each output edge of a computational 
node such that the weight assigned to each edge is equal to the 
number of the computing stages of the computational node. For 
example, the weight w(e(t , 7)) assigned to the edge e(t , 7) is equal 
to the number of computing stages of node v;. 


Definition 2: The cost (or weight) of any k th path ¢,(v, ,v,) 
from node v, to node uv, can be defined as the sum of the uchts of 
all edges along the path. That is, 

w (e (2 ’ j)). 


w(o.)= a 


e(t, s)Ed. (2, » ‘ 


Thus, the cost of a path from node v, to node v, is equal to the 
number of computing stages needed for an operand to travel along 
the corresponding path from node v, to node v, 


Definition 3: A weighted ADFG GW with an input node wg is 
said to be balanced if the cost for any two different paths from the 
input node wy to an arbitrary multi-input node u, is equal. 


This definition indicates that a balanced ADFG achieves maximum 
pipelining. Unfortunately, most ADFG’s derived from given tasks 
are usually unbalanced. To balance an ADFG, appropriate delay 
buffers must be inserted along appropriate paths from the input 
node tw, to any particular multi-input node of interest. Thus, any 
different paths from the input node uw, to a multi-input node will 
have equal costs. The appropriate buffering graph in which delay 
buffers are inserted to balance an unbalanced ADFG can be defined 
as: 


Definition 4: A buffering graph GB=(V,E,WB 
corresponding to a weighted ADFG GW =(V ,£ ,W) is a weighte 
graph where the weight WB corresponds to the buffering introduced 
on &. Then, GB is called a buffering graph of GW. Furthermore, an 
ADFG GW’ = 6 , E,W’) can be composed from GW and GB such 
that w(e(i,7))=w(e (i, j)) + wh (e(t, 7)), for alle (¢ , 7) € H, where 


) is the weight of the buffers from node v; to node v 
is a lis ADFG, then GB is a balanced buffering on for 


ai 

“lt can be shown that a buffering graph GB for a corresponding 
GW always exists, though it may not be unique. In order to minim- 
ize the cost for implementing an ADFG in a VLSI device, it is desir- 
able to obtain a balanced buffering graph with a minimum number 
of delay buffers. Since the cost for any two different paths from the 
input node uy to an arbitrary multi-input node u, must be equal for 
a balanced ADFG, delay buffers can be inserted to balance the cost 
for all paths from.the input node uy, to a multi-input node u,. 
Assume U = {uo,¥,,U2, °°" ,u,} is a finite set of all multi-input 
nodes and the input and end nodes in GW, and there are m, paths 
from the input node wu, to’a multi-input node u,, that is, 7 Coeerig 
() 1<1<m, andi<k<n. The critical path $)° (u,) of a multi- 


input node u, in GW is the path from the input node uy, to the node 
up, l<k <n, having the “heaviest” path weight defined as: 
w'(u)Su(e(u))2 mex SC  wlel, a). 
PSSM (i, JE O(up) 
No other path from the input node u, to the node u, can have a path 
weight greater than the critical path weight w‘(u,). Thus, the cost 
of the critical path from the input node uy to the end node u,, consti- 
tutes the initial delay time of the pipeline. In order to balance an 
ADFG, buffers B(e (i, j)) are introduced to insert into appropriate 
paths $1 (uy, ), from the input node ug to a multi-input node u,, 
1<k<n, to achieve all paths entering the node u, to have the 
same cost. That is, 
(2) 


&  w(e(t,7))+ 
e(5,)€ #i(p) B(e(é 


= w* (uz) + 
Ble(t 


(1) 


[Ble (i, 9)) | 
|Ble(, 9) | 


> 


»J)) €$)(4;) 


= 
i) €4) (uy) 


where | Ble (i, 7) | is the weight or the number of computing stages 
in the buffer Beli 1<l<m,and1<k <n. The first term in 
Eq. (2) i is a eoneae a can be easily computed. The problem of 
finding all critical paths of u,,1<k <n, is known to be solvable b 
applying Bellman’s equation with time complexity of |N |?) [1], 
where NN is the number of computational nodes in the GW 


Since it is desirable to minimize the initial delay time of the 
pipeline so that it is equal to w‘{u,), no delay buffers B(e (2 , 7)) 
should be assigned to the critical path ¢,+(u,) of the end node u,. 

Lemma 1. The critical path ¢,+(u,) of the end node u, is 
independent of the buffer stage variables. 


Taking this into consideration and rewriting Eq. (2), we have 


» [Ble(é, 3))I- > 


1,9) €9/(u,)\9, + (u, Ble) Ed, (u 


wie(t, 


Ble( 


‘NNO + (u,,) 


Ds j))]=6(@,) (3) 


e(¥, JE 9)(4,) 


where b(1,k) is a computed integer constant, Lea 

undetermined buffer stages, 1<1<m,,1<k <n, and the notation 
$, (us )\O)* *(u,) denotes set subtraction and is defined as 
$1 (up \ 0," (ty) = $1(tue) — (4: (tte) 76,2 (wy). Equation (8) is a set of 
linear simultaneous equations and can be expressed in a matrix- 
vector form as Ax=b, where A is a matrix introduced from the 
paths, x and b are unknown buffer stage vector and computed 
integer constant vector, respectively. The solution x is usually not 
unique, however, we can impose some restrictions on the problem to 
become an integer linear optimization problem. That is, we would 
like to minimize the total number of buffer stages in a balanced 


buffering graph GB: 
Min 
Ble(i, 


= [w°(u,) — 


i,3)) | are 


> |Ble(z,9)) | (4) 


j)) €GB\ $b +(u,) 


subject to the equality constraints of Eq. (3) and | B(e A ,I)) |>0, 
integer. The above integer linear programming problem can be 
solved in pseudo-polynomial time [9]. 


In the above buffer assignment problem, the number of buffer 
stages are obtained from the solution of the integer linear program- 
ming problem, and the buffers are placed on the edges in the 
buffering graph GB corresponding to the GW, except the critical 
path ¢/+(u,) of the end node u,. In order to reduce the total number 


(4) We use the notation ¢)(u:,u,) to indicate an /th path from node u, to 
[tga 4k 1 
node u,. If node u,; is the input node to, then ¢,(ug, u,) = ¢,(u;). 


| B(e(t 59) | 
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of buffer stage variables in the optimal buffer assignment problem, an 
equivalent transformation is performed on a balanced buffering 
graph GB to transform it to a normalized buffering graph GB which 
is still a balanced buffering graph (since a balanced buffering graph 
is not unique) with respect to the weighted ADFG GW [12]. With 
the equivalent transformation on a balanced buffering graph GB, 
the optimal buffer assignment problem can be reformulated for the 
normalized balanced buffering graph GB’ instead of the balanced 
buffering graph. This, in effect, greatly reduces the total number of 
buffer stage variables because these variables are attached to 
multi-output (or multi-input) nodes. 


While constructing the integer linear programming formula- 
tion for the normalized balanced buffering graph for a weighted 
ADFG GW, it can ie shown an many redundant integer linear 
constraint equations (in Eq. ( es making the optimization 
problem more difficult than it should be. The redundant integer 
linear constraint equations come from the path overlapping between 
two paths of two different multi-input nodes. A path decomposition 
technique is utilized to remove redundant integer linear constraint 
equations. Let ¢,(u,) denote an! th path from the input node uy to a 
multi-input node u, which passes through some other multi-input 
nodes. Among these multi-input nodes, a multi-input node u which 
is nearest to the node u, is selected to decompose the path ¢/(u,) into 
two sub-paths, that is, $,(u,)=$)(u ie? ae ,u,). Thus, the 
integer linear constraint equations of the path ¢;(u,) with respect to 
the node u, can be written as: 

3)) | 


d} we (t,9))+ 


e(,7) € o)(u,) Ble(i, 


x we (t,9)) + 


e(i,7) €¢/(u") B(e(i, 
ys : we (t,7)) + a ; 
e(t,7) Ed (u ,u,z) Ble(t,7)) €G(u jap) 


where 1 <1 <m,. Using Eq. (2) for the path ¢,(u") to the node u’, 
the result of Lemma 1, and Eq. (3), Eq. (5) becomes 


>; | Ble (s, 


J)) €$)(u,) 


» 


i) E(u") 


(5) 
| Ble (i,9) | 


+ 


| Ble (2,7) | 


ys |B(e(¢,7)) |+ aa 
Ble(t 3) E6(e" s wy)\O,+ (Up B(e(#,3)) €#,¢(4 Ojo (ua) 
|Ble(t,7)) |- | Ble (t , 3) | (6) 
Ble(i,3)) EG + (Ug )\O, + (U, ) 


wile (t,7))]. 


* 
» Uy) 


[w* (uz) — we (u") — 
e(i, 5) €4/(0 


With the above procedure for reducing redundant equations, 
the integer linear constraint equations for the normalized balanced 
buffering graph with respect to a weighted ADFG GW can be con- 
structed according to the Procedure ILEG (Integer Linear Equation 
Generator) listed below. 


Procedure ILEG (GW , ILCE(GB )). This procedure gen- 
erates a set of integer linear constraint equations ILCE(GB ) for a 
normalized balanced buffering graph GB with respect to a given 
weighted ADFG GW with labeled nodes. 


Ii. [Determine all critical paths.| Find all the critical paths 
$)° (u,) and the cost of each critical path w‘(u,) with respect 


ie a multi-input node u,,1<k <n, by applying the Bellman’s 
equation [1]. 
I2. [Assign buffer stage variables.] Assign buffer stage variables to 
the output edges which are attached to multi-output nodes, 
except for the output edges belonging to the critical path of 


the end node u,. 


I8. [Generate integer linear constraint equations.] For any path 
d,(u,) with respect to a  multi-input node 
upyl<l<m,1<k<n,if ¢,(u,) does not pass through any 
other multi-input nodes, then use Eq. (3) to generate integer 
linear constraint equations. Otherwise, use Eq. (8) to generate 
integer linear constraint equations where u €¢)(u,) is a 
multi-input node nearest to the node u, selected for path 
decomposition. Note that the paths and their costs between 
two multi-input nodes may be found with time complexity 
O (n°) by using the path-finding algorithm [1]. 


[Output integer linear constraint equations.| Output the 


integer linear constraint equations from Eq. (3) or Eq. (6) and 
return. 


END ILEG 


I4. 


Let us illustrate the above Procedure ILEG by an example. 
Fig. 1(a) shows a weighted ADFG GW. We would like to obtain an 
optimal normalized balanced buffering graph GB’ corresponding to 
the GW. 


Step 1. NodesG,J,K,and M are multi-input nodes. Then the crit- 
ical path for 


Node G: $,+ (G)=Path A-—C-—G ,w°(G)=25. 
Node J: $,*(J) = Path A-C—-G—J,w‘*(J)= 31. 
Node K: $,.(K) =Path A-C-G—K, w*(K) = 31. 
Node M: ¢,+ (M)=Path A-~C-—G—J—M,w‘*(M)= 


Step 2. Obtain the normalized buffering graph as shown in Fig. 1(b). 
Step 3. The integer linear constraint equations are generated 
according to Eq. (3) or Eq. (6) 
(a) FornodeG: 
(i) PathA—B-E-G: |B, |+|B,|=(25-5-6-2) = 
(ii) Path A—B-G: |B,|+]|B,; |=(25-5-6)=14. 
(b) Fornode J: 
(i) PathA—B—F—J: |B, |+|B,|=(31-5—6—12) =8. 
(c) Fornode K: 
(i) PathA~D—H—K:|B,|+|B,|—|Bs|=13. 
(ii) Path A-—D—I-K: | Be |+ |B, |—|Bs | = 9. 
The above integer linear constraint equations have been generated 


according to Eq. (3). The following case will show the integer linear 
constraint equations generated by using Eq. (6). 


(d) For node M, we select u° = K €¢,(M) as the multi-input 
node nearest to the node M. The ti, $1( NM) ) can be decom- 
posed into two sub-paths, that ert ae mt oils ,»M). 
According to ae (6), we have | Bs at 41-3 sy 2. 

subject to the constraints of the 


Minimize > |B; | 


=] 
integer linear equations generated in a 3. The optimiza- 


tion gives |B, |=8, |B,|=9, |B, | B,|=4, |B; |=6, 
¢ |= 4, 7|=0, Be | = 0, 1 Bele, and the total 
number of buffer stages is 33. 


Al. 


12. 


Step 4. 


Il. Formulation for Decomposition Approach 


The previous section indicates that if the task graph is simple, 
then the buffer assignment problem can be easily solved as illus- 
trated in the above example. However, if the number of computa- 
tional nodes in an ADFG is quite large, then integer linear constraint 
equations may grow tremendously, making the optimization prob- 
lem more intractable. Thus, a systematic approach in reducing the 
computational difficulty in a large-scale integer linear optimization 
for the buffer assignment problem must be devised. A decomposition 
approach, which utilizes the critical path concept to decompose the 
task graph into a set of connected subgraphs from which the integer 


linear optimization technique can be used to solve the buffer assign-. 


ment problem in each subgraph, will be addressed in this section. 
Lemma 2. If a multi-input node uy, € $)° (u 
path is $,* +(u,), then 9, (u,) C $y +(u 
contradiction ([12].) — 
Definition 5:_Let GW =(V,E,W) be an undirected graph 
with N=|V|andM=|EF|. A connected component 7, of GW is 
a maximal connected subgraph, which is a connected subgraph that 
is not contained in any larger connected subgraphs. 
Definition 6: A directed block 7,, of a directed graph GW" 
a oe subgraph, and its corresponding undirected subgraph 7,, 


(i.e. = Undirect \(@,)) is a connected component, of the 
Eerie undirected graph GW (GW =Undirect(GW'‘)). 


The problem of finding all the connected components of an 
undirected graph GW may be solved with the time complexity of 
aw + M) by using the_depth-first search algorithm SEARCH 
1 


,) and its critical 
o, (Lemma 2 can be proved by 


,1,) in [1], where GW is an input undirected graph and Tn, 
m < m,,, are output connected components, where m,, is the 
number | of the directed blocks in the corresponding directed graph 
GW” of GW. The problem of finding the directed blocks 7,, of a 
given directed graph GW" may be solved by a modified depth-first 
search algorithm which is described in the Procedure DBS1 (Directed 
Blocks Searcher1) listed below. 


(5) The notation Undirect (7, ) means taking the directed arrow of 7, 
out. 
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Procedure DBS1 GW” ,7,,). This procedure finds all the 
directed blocks of a given irected graph GW . 


D1. [Obtain the undirected graph of GW") Let 
GW = Undirect (GW "). That is, remove the directed arrow of 
Gw’. 


[Determine undirected connected components of GW |] Find 
all the undirected connected components, 7,,,1<m<m,,, of 
GW by the depth-first search algorithm SEARCH (GW,,7,, ). 


ahaa directed blocks. .| Obtain all the directed blocks 
Tnyl<m<m,,, by assigning the directed arrow back to 7,,, 
1<m<™m,,, according to the input directed graph GW’. 


[Output the directed blocks.] Output all the directed blocks 
Fnyl<m< mz. 
END DBS1 


The connected components 7,, from the depth-first search 
algorithm SEARCH (GW,7,,) and the directed blocks 7,,, 
1<m<™m,,, from Procedure DBS1 will be used in our decomposi- 
tion approach in obtaining a set of connected subgraphs from which 
the integer linear optimization technique can be applied to each sub- 
graph to solve the buffer assignment problem. Our decomposition 
approach utilizes the critical path of the end node u,, i.e. $)* (uw, )as 


a cut set to partition an ADFG GW into several subgraphs. The 
procedure of graph partition and the determination of decomposed 
subgraphs (or directed blocks) is called graph decomposition [10]. 

The idea of the graph decomposition approach is to first take the 
critical path of the given directed graph out. This creates several 
edge disjoint subgraphs with some of the edges not connecting a pair: 
of nodes because the nodes in the critical path are removed. In order 
to remedy this, nodes that are in the critical path an (u, ) and are 


D2. 
D3. 


D4. 


attached to two or more edges (incoming or outgoing) are called the 
decomposed nodes and denoted by w, (as the kth decomposed node); 
each of these decomposed nodes u, will be “split” into several 
independent pseudo-nodes u dp, 1<t<d,, which are labeled accord- 
ing to the attached edges from left to right, and the last pseudo-node 
a,’ is always assigned to the kth decomposed node in the critical 
path $y (u, ), where d, is the number of independent pseudo-nodes 


for the kth decomposed node. Thus, a new directed graph GW con- 
taining split directed subgraphs of the ADFG GW can be obtained 
by removing the critical path a (uw, ) and “splitting” the decom- 
posed nodes. That is, GW’ =(GW\ $+ (un )) LJ {labeled pseudo- 


nodes t,1<i<(d,—1),1<k< kpn}, where kpy is the number of 
the decomposed nodes in GW. The determination of the directed 
blocks 7, of an ADF'G GW when the critical path 9) (u, ) is taken 


out is very similar to the Procedure DBS1 for finding the directed 


blocks 7, of GW". The directed blocks 7, and 7,, are always 
equivalent except for the existence of the pseudo-nodes, 
up ,l1<i<d,. The procedure for determining the directed blocks 


7, of an ADFG GW when the critical path 9)? (u,, ) of the end node 


u, is taken out can be described in the following Procedure DBS2 
(Directed Blocks Searcher2). 


Procedure DBS2 (GW ,7,,,). This procedure finds all the 
directed blocks of GW when its critical path ¢,+(u, )is taken out. 


S1. [Remove critical path in GW and label decomposed nodes. 


(i) Obtain all the subgraphs from the ADFG GW by remov- 
ing the critical path ¢,+(u,, ) of the end node u,, and split- 


ting the decomposed nodes u,,1<k < kpy. 


(ii) Label the independent pseudo-nodes of the decomposed 
node Ur 6 that is Up, 1<i<d,, and 


d 
Up, = Uy 'O ti ty ar -@iu,', where @ is the direct sum of 
the pseudo-nodes ssa from the same decomposed node. 


[Construct Gw’ | Construct a new directed graph GW* 
which are the split directed subgraphs with labeled pseudo- 
nodes in step 51. pw (4-1) 

pn (4,— 


GW" ={GW\4,-(u, )} Ut UU {ii} }. 
k=l t=] 
pind the directed blocks of GW" zal Use DBS1 (Gw’, i. m) to 
nd the directed blocks 7’, of Gw". 


Identify and merge pseudo-nodes in each directed block.| 

etermine the labeled pseudo-nodes which come from the same 
decomposed node and are in the same directed block 7,,. 
These labeled pseudo-nodes will be merged into a big labeled 
pseudo-node by the direct sum operator @. 


Sb. ages and output the directed blocks 7,,.] Obtain 7, 
rom 7,, by applying the pseudo-nodes merging procedure in 
step 54 and output 7,,,1<m<m,,. 


m? 


$2. 


S3. 


54. 


END DBS2 


Using the Procedure DBS2 (GW ,77,,), we can obtain all the 
directed blocks of GW, 7,,,1<m<m,,. Furthermore, new sub- 
graphs can be constructed from 7,, and defined as 
Tm =Tm Jpn 9,2 (Un ), for 1<m <m,,, where the operator |_Jpy 


means performing the set union of 7,, and ¢,+(u, ) (except the 


pseudo-nodes) and the direct sum on the pseudo-nodes coming from 
the same decomposed nodes in 7,, and ¢,+(u, ), simultaneously. 


These new subgraphs are called pseudo-connected components of the 
ADFG GW and will be used to decompose the buffer assignment 
problem into several small subproblems. 


Let 1B, be a normalized balanced buffering graph for a and 
ILCE (7B,,) be the associated set of integer linear constraint equa- 
tions which is obtained from the Procedure ILEG. Since an ADFG 
GW may have a large number of nodes, determining the buffer stage 


variables in GB from its large number of integer linear constraint 


ce 


equations may not be desirable. Since GB = | Joy 7B we would 


m=! 
like to use this fact to see whether solving the buffer stage variables 
in each 7B,,, 1<m<m,,, separately and independently is 
equivalent to solving the buffer stage variables in GB. If this is true, 
then we have divided a large-scale integer linear optimization prob- 
lem into m,, smaller-scale subproblems, each of which can be easily 
solved. 


Theorem 1. Let GB and 7B. 1<m<m,,, be, respectively, 
the normalized balanced buffering graphs of GW and its pseudo- 
connected components 7,,,1<m<m,,. The buffer stage variables 
_ in GB can be determined from their associated sets of integer linear 
constraint equations, ILCE (7rB,,), 1<m<m,,, separately and 
independently. Furthermore, the buffer stage variables determined 
from the set of integer linear constraint equations, ILCE (7B,,,,), 
have no relations to the buffer stage variables determined from the 
set of equations, ILCE (7B,,,2), where m1 4 m2. 


Proof: In order to prove the above theorem, we follow the 
procedure for constructing the associated sets of integer linear con- 
straint equations for GB and show how they can be replaced by 
ILCE (xB,,), 1<m<m,,. For convenience, we assume there is 
multi-input node u, in both GB _{or the corresponding GW ) and 7B,, 
(or the corresponding 7’,,). 7B,, is the mth pseudo-connected com- 
ponent of GB. Assume that the associated paths from the input 
node ug to the node u, in GB (or GW) are res ),1<l<m,, Two 
cases are possible: (1) some of these paths pass through 7B,, only, 
and (2) some of them pass through some other pseudo-connected 
components of GB. Incase (1), because the paths in GB are also the 
paths in 7B,,, we will obtain the same resulting associated sets of 
integer linear equations for the paths in GB and the paths in 7B,,. 
In case (2), the paths from the input node ug to the node wu, may pass 
through some other pseudo-connected components, but they must 
intersect the critical path ¢,+(u,) of the end node u, at some nodes, 


and finally end at the node u, in TB, It has been shown previously 
that a multi-input node u_, which is on the critical path ¢,+(u,) and 


nearest to the node u,, can be selected to decompose the path into 
two subpaths, that is, ¢,(u,)=¢,(u:)+¢,(u ,u,), where ¢,(u_ ) is 
the path from the input node u, to the node wu and passes through 
some other pseudo-connected components, and the entire traversal 
of the path ¢,(u ,u,) is in the 7B,,. Thus, the associated integer 
linear equation for the path ¢,(u, ) in GB can be rewritten as in Eq. 
(5). Using Lemma 1 and Eq. (2), the first two terms on the right hand 
side of Hq. (5) can be written as 

(7) 


we (t,3)) + 


e(i,7) €4)(a’) 


«(| Be (#53) | 


B(e(i,7)) €4)(u) 


y 1B G7) |- 


B(e(i,d)) €4,+ (0°) 


w(u)+ 


Using the result of Lemma 2, the critical path to the node u i 
¢,+(u ), is the path from the input node Uo to the node u along the 
critical path ¢)*(u,), that is, ¢,-(u )=¢,*(uo,u ), which is 
independent of the buffer stage variables. Then Eq. (7) becomes 
ES wlets)+ C (Blelis) 
e(7,7) Ed,(4 ) Ble(i,7)) €¢)(4 ) 
=w‘(u')=a constant. 
Substituting Eq. (8) into Eq. (5), we have: 


% wle(tz))+ SY IBle(,3)) |=w*(u’)+ 


e(i,7) €9,(u,) Ble(i,7)) € b)(u,) 


(8) 


(9) 
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>, | Ble (7,3)) |. 


e(i,7) EG(a”, a4) 


a 


B(e(i,)) €b)(u sug) 


w (e (2 J)) Sig 


_ Equation (9) indicates two things: First, the associated set of 
integer linear equations with respect _to the node u, €7B,, depends 
only on the buffer stage variables in 7B,, and are independent of the 
buffer stage variables in the other pseudo-connected components 
because ¢)(u ,u,)€7B,,, Second, Hq. (9) can be generated and 
replaced by a path in 7B,,, that is, the path travels from the input 
node u, to the node u along the critical path ¢,+(u,), then from 
node wu’ to node u, along the path ¢,(u’,u,) in 7B... So, for any 
multi-input nodes u, belonging to GB and 7B,,, it has been shown 
that the associated set of integer linear equation system for node u, 
in GB can be replaced by the, associated set of integer linear equa- 
tion system for node u, in 7B,,. In other words, the associated set of 
integer linear equation system for GB, ie. ILCE (GB), can be 
replaced by the associated sets of integer linear equation systems for 

O 


TB my ie. ILCE (1B), 1<m<m,,. 
Using the results from Theorem 1 and based on the fact that 


> 1 Ble(t.9)) | 


m B(e(t,7)) E GB 


» 
m=l Ble(i,j)) emBt 
problem in Eqs. (4) and (3) can be rewritten as follows: 


Mn SS [Ble(i3))| 


m= Ble(i,s))EnB A 


becomes 


| B(e(t,7)) |, and the integer linear optimization 


(10) 


subject to, the associated sets of integer linear equation systems 
ILCE (7B,,),1<m<m,,. Because the buffer stage variables in 
different pseudo-connected components of GB are independent, Eq. 
(10) can be decomposed into the following subproblems: 


Foreachm =1,2,°°° 


Min 


9 Mees 


yy 


Be(i,7)) EnB 


| Ble (t,)) | 


subject to, the associated set of integer linear equation system 


ILCE (xB). 


This graph decomposition approach provides us with a tech- 
nique to divide a large-scale integer linear optimization problem into 
a number of m,, smaller-scale subproblems, each of which can be 
easily solved in pseudo-polynomial time. Let us apply the above 
decomposition approach to solve the same buffer assignment prob- 
lem in section II. 


Step 1. (a) Decompose the ADFG GW in Fig. 1(a) into subgraphs 
by removing the critical path 9, (M) of the end node M. 


Label the pseudo-nodes of the decomposed nodes A,G,J, M, 
that is, {A, »Ao,As3}, {G,,G2,Gz3, G4}, {J,,Jo}, and 
{M, , M3}. 

Construct 

GW = (GW\¢,* (M)) LJ {A1,42,G1,G2,G3,41; M}}. 

Note that pseudo-nodes A;, G4, Jo, and Mp are attached to 
the critical path a (M). GW ,¢,*(M), and the labeled 


pseudo-nodes are shown in Fig. 2. (2(a), 2(b), and 2(c)). 
Step 2. This step is the same as the Procedure DBS2 (Gw : ee F 
(a) Use Procedure DBS1 (GW *,7,,) to find the 7,,,1<m <2, in 
GW". These directed blocks are shown in Figs. 2(a) and 2(b). 
() 


Merge the labeled pseudo-nodes that come from the, same 
decomposed node and are in the same directed block 7',, into 
a big labeled pseudo-node by the direct sum operator. For 
example, G, and G, are the labeled pseudo-nodes coming 
from the decomposed node G in 7, and will be merged into 
Gi2 =G, OG). 
(c) Obtain 7, from 7,,, m =1,2, by applying the pseudo-nodes 
merging procedure. 7, is shown in Fig. 2(d). 


Step 3. Let Tt. =, pn $+ (M), 1<m<2, which are the 
pseudo-connected components of GW (Figs. 2(e) and 2(f)). 


(b) 


(c) 


Step 4. The corresponding normalized balanced buffering graphs 
GB and 7B,, can be easily obtained by the buffer assign- 
ment rules and have the same graph structure as GW and 
Tmyrespectively. The buffer stage variables Bi , Be B3 Bi 
in 7B, as shown in Fig. 2(g), and B?, Be, Be, B?,B2 in 
mB, as shown in Fig. 2(h) correspond to the buffer stage 
variables B,, Bz, By, Bs and Bo, Bs, Bz, Bs, By in GB as 
shown in Fig. 1{b), respectively. 


Step 5. Generate ILCE (7B;') and ILCE (7B) as follows: 
ILCE (nB;): |B} |+|B3 |=|B,|+|B,|=8 


|B} [+1 Bi |=[B,|4+1B,|=12 
|B} |+ 1B} |=|B,|+|Bs|=14. 

ILCE (Bz): |B? |+|B? |—|B? |=|B.|+|By|—|Bs]=13 
|B? |+|BF |-|BzZ |=[B.|+[B,|-|Bs|=9 
|B? |+|B2 |= |B, |+|Byl=2. 


Step 6. The integer linear programming problem for GB can be 
solved by two separated subproblems: 
(1) 


Min SS |B; | = Min[|B,]+/B3|/+]/B,]+]B5l] 


B, €nB tT — + 
subject to the ILCE (7B, ) (found in Step 5). 
Min YY |B; |=Min [|B ]+|Bo]+|B7|+|Bel+1 Bol] 


B; enBy — + 
subject to the JLCE (mB, ) (found in Step cs 
The optimization of subproblem (1) yields [ B, |= 8, |B; |=0, 
|B,|=4, |B; |=6, and the optimization of subproblem i 
gives |B, |=9,|Bs|=4, |B,|=0, |B, |=0,|B,|=2. The 
results and solution are the same as given in the example in 
section II, but the optimization is much faster and simplier. 


(2) 


The above graph decomposition approach is applied to solve 
the buffer assignment problem of a larger problem — balancing the 
CORDIC-based pipelined architecture to achieve maximum pipelin- 
ing for computing the joint solution of a PUMA robot manipulator 
[8]. Using Procedure DBS2 (GW ,7,,), where GW is the directed 
task graph, 16 directed blocks, 7,,,1<m <16,in GW are obtained. 
From these directed blocks, we can obtain the 16 pseudo-connected 
components, 7,,, 1<m<16. The corresponding normalized bal- 
anced buffering graph GB for GW and the 16 pseudo-connected 
components in GB, 7B,,, 1<m < 16, can be created. The associ- 
ated sets of integer linear equation systems for 7B,,, 1<m< 16, 
can be obtained from the Procedure ILEG. The optimization solu- 
tion for all the integer linear optimization subproblems yields a total 
of 159 buffer stages which agree with the solution given in [8]. 


IV. Conclusion 


An efficient graph decomposition technique which provides a 
systematic approach in solving the optimal buffer assignment prob- 
lem of a large-scale ADFG has been presented and discussed. The 
optimal buffer assignment problem is formulated as an integer linear 
programming problem. The construction of integer linear constraint 
equations in a large-scale ADFG reveals the existence of many 
redundant integer linear constraint equations, making the optimiza- 
tion more intractable. The proposed graph decomposition approach 
utilizes the critical path concept to decompose an ADFG into a set of 
connected subgraphs from which the integer linear optimization 
technique can be used to solve the buffer assignment problem in each 
subgraph. Thus, a large-scale integer linear optimization problem is 
divided into a number of smaller-scale subproblems which can be 
easily solved in computers in pseudo-polynomial time. The proposed 
graph decomposition technique is illustrated by two examples, and 
its efficiency and advantages can be seen in the example for balanc- 
ing a CORDIC pipelined architecture to achieve maximum pipelin- 
ing for computing the robot inverse kinematic position solution. 
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(b) GB 


Figure 1. An Example for Buffer Assignment Problem 


(g) TB, 


(h) mBS 


Figure 2. Graph Decomposition of the Example in Figure 1. 


DILATION-2 EMBEDDINGS OF GRIDS INTO HYPERCUBES 


Mee- Yee Chan 
Computer Science Program 
University of Texas at Dallas 
Richardson, Texas 75083-0688 


ABSTRACT 


This paper addresses the following graph-embedding 
question: given a two-dimensional grid, and the smallest 
hypercube with at least as many nodes as grid points, how 
can we assign grid points to hypercube nodes (with at most 
one grid point per node) so as to keep grid-neighbors near 
each other as possible in the hypercube. We give a simple 
strategy which ensures that grid-neighbors are always 
mapped to hypercube nodes that are within a distance of 
two edges of each other. 


1. INTRODUCTION 


One of the key features of the hypercube is a rich 
interconnection structure which permits important network 
topologies, such as grids and trees, to be efficiently 
simulated. A binary hypercube of dimension 7 or 
binary n-cube can be thought of as an undirected graph 
of 2” nodes labeled 0 to 2”—1 in binary; two nodes are 
connected by an edge if and only if their labelings differ in 
exactly one bit position. To simulate a grid or a tree on 
the hypercube, nodes of the grid or tree must be mapped to 
hypercube nodes. 


The question of interest here is: how can we map the 
nodes of any two-dimensional grid to the nodes of its 
optimal hypercube (the smallest hypercube with at least 
aS many nodes as the grid), on a one-to-one basis, so that 
dilation (the worst case distance between grid-neighbors in 
the hypercube) is kept to a minimum. 


A number of researchers have studied this problem 
IBMS, BS, CC, G, HJ, SS], with the following results. 
Over 61% of all two-dimensional grids can be embedded 
into their optimal hypercubes with a dilation of 1 (i.e. all 
grid-neighbors are also neighbors in the hypercube) by 
using binary-reflected Gray codes [SS]: Figure 1 shows how 
a 6x11 grid can be mapped into its optimal 7-cube. For 
the other over 38% of all two-dimensional grids, which 
have been proven to need at least dilation 2 [BS], we have 
the methods proposed in [BMS], [CC], [G] and [HJ]. 
[BMS], [HJ] and [CC] have shown that a substantial 
percentage of these grids (over 70% of the 38%) can be 
embedded with dilation 2, while [G] claims that all two- 
dimensional grids can be embedded with dilation 5. 


The prevailing sentiment is that all two- 
dimensional grids should be embeddable in their 
optimal hypercubes with at most dilation 2, however, 
this has yet to be shown. This paper introduces a simple 
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embedding strategy which does in fact confirm this 


conjecture. 
2. THE EMBEDDING STRATEGY 


Notation: Let Ny, denote the sequence of k-bit binary- 
reflected Gray code, and let N,(p) denote the (p+1)st 
element in the sequence N,. For example, 

Ni == (0,1), 

No = (00,01,11,10), 

N3 = (000,001,011,010,110,111,101,100) 

and N2(4) = 110. 


Assume all logs are in base 2. Suppose we are given an x x 
y grid G. 


CASE 1. sy > gllogz }+ [logy 1 ort == ollogs | ory = ology | 


Then, embed G into its optimal hypercube using the 


binary-reflected Gray code strategy, and hence, with 
dilation 1. 

CASE 2. otherwise 

Assume, without loss of — generality, aS = aloes! 


(otherwise, we can rotate G by 90 degrees to assure this). 
Since zy < gloss }tLllogy 1 our objective is to label each 
node of the grid with a unique (|logz |+ [logy |+1)-bit 
binary number, which effectively names the node in the 
optimal ( [logs |+ [logy |4+1)-cube to which it is mapped. 
Since we have dilation 2 in mind, we allow the labels for 
grid-neighbors to differ in at most 2 bit positions. 


Step 1. Determine the first |logz| bits of each node’s 
label. 


Create allcs#] “chains”, each of which is described by a y- 
vector of 1’s and 2’s. The vector for the first chain is 


(411,419, Ley a1y)= 


| 2c | | 2 _ ye | |(y-1)z 
allogz | gllogz |] gllogz |’ * | ollogz | gllogz | : 


For the 7th chain, 7 = 2.3,...,qloee] the vector is 


(a; 1,4; 9, rey ayy) = 


(:—l)a 4—2)a 
|e 7 ollogz | » A141) G12,-+ +5 Bryy4 


The chains have the following properties: 


y 
>y ayy < ate < gllogy 1 for i==1,2,... gles] 
j=l 
gLlogs | 
>) @;;=2 for 7=1,2,...,y 
ed | 


as well as 


k 
kz kz 
Dati : lez | rat 


kx ka 
(eal Fol for 9 == 1,23 


e 3 ° ° 
and since z soo no consecutive 2’s are possible 


) for 1=1,2,...,gle8#] 


k 
25 %,5 € 


t=] 


within each chain vector. 


The idea is that each chain vector represents a “chain”: for 
example, (2,1,2,1,1,2,1,1,2,1,2) represents the chain depicted 
in Figure 2. Hence, for example, an 11 x 11 grid is 
associated with 


(29710 11.919) 

(Lo 2 12 at 21) 

(1,1,2,1,2,1,1,2,1,1,2) 

(2,1,1,2,1,2,1,1,2,1,1) 

(je debe 115251) 

(it) 11,2152\ 11-2) 

(2,1,1,2,1,1,2;1;2;1,1) 

(1,2,1,1,2,1,1,2,1,2,1), 
and pictorially, we have Figure 3. Aligning the nodes of 
the above graph into 11 rows, or in general rows, we get 
graph G, as shown in Figure 4. The 21°87! chains will 


cover the x x y grid completely because of the properties 
stated above. 


Each node of the x x y grid belonging to the ith chain is 
given Niogej(?-1) as the first l[logr] bits of its 
(Llogz J+ Llogy J+1)-bit label. In this way, the first [logz | 
bits of adjacent nodes in the x x y grid differ in at most 
one bit position. Note that, since each chain has length 


< gllcsy #1 nodes, Llogy|+1 bits are sufficient to 
distinguish the nodes of each chain. 

Step 2. Determine the last [logy |+1 bits of each node’s 
label. 


Firstly, the jth node of the 7th chain is marked with 
(t; +7)mod Qlesvi! where t;=-1 and t; =t;,-a4;)4+1. 
So, for the 11 x 11 grid, we have the situation shown in 
Figure 5. This marking, because of the nature of the G, 
graph, has the following property: the marking of adjacent 
nodes of the grid differ by at most 2 (in mod gllcgyi1), 
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Figure 6 shows all possible marking scenarios for a node T 
and its grid-neighbor S. For any node T marked with f¢, 
the node S below T is marked ¢t-1 if T and S are in the 
same chain, and ¢ otherwise. For any node T marked with 
t, the node S left of T is marked ¢-1 or t—2. 


Note: If we use Njiogyyilt) as the last Llogy]+1 bits of 
each node marked with ¢t, then we effectively have an 
embedding of G into a ([logr|+ [logy|+1)-cube with 
dilation 3, i.e. the labels for grid-neighbors differ in at 
most 3 bit positions. Adjacent nodes in the grid will differ 
in at most 1 position of their first |loga | bits and at most 2 
positions of their last [logy |+1 bits. 


By changing each mark ¢ into |¢/2|, we have a marking 
with the property that marks for adjacent nodes in the 
grid will differ by at most 1 (in mod 2U!°8¥J), So, for the 11 
x 11 grid, we have Figure 7. The chains have been 
horizontally extended to have exactly glloesy #1 nodes each. 
Call such a marked graph Go. 


Our next objective is to color each node of the grid either 
red or black so that 


(a) 
(b) 


two nodes marked with the same number belonging to 
the same chain are colored differently, and 


two adjacent nodes marked with different numbers 
belonging to different chains are colored the same. 


Condition (a) ensures that each node of the grid is indeed 
mapped to a unique node in the hypercube, and condition 
(b) ensures that dilation 2 is achieved for adjacent nodes of 
different chains. 


Whether we can do this coloring hinges on whether the 
graph Gs, which has as its nodes the nodes of the extended 
graph G, but has as its edges the set {(S,T)| nodes S and 
T are marked the same and belong to the same chain in 
Go} LU {(S,T)| there exists a node R such that nodes S and 
R are adjacent but belong to different chains and are 
marked so that S’s mark is one less than R’s, and nodes R 
and T are marked the same and belong to the same chain}, 
is bipartite or not. For the 11 x 11 grid, the graph Gs, 
would take the form shown in Figure 8. As it turns out, 
the G3 graph for any grid, in general, can be shown to be 
acyclic, and hence, bipartite and colorable according to our 
objectives. 


With such a coloring, we can do the following. A red node 
marked ¢ is given ONijogy{t) as the last Llogy|+1 bits of 
its label, while a black node marked ¢ is is given 
LN hogy }(é) as its last [logy |+1 bits. In this way, adjacent 
nodes of the same chain will differ in at most 2 bits 
position of their last |[logy|+1 bits and share the same 
initial [logz| bits, making for a dilation of 2; adjacent 
nodes of different chains differ in at most 1 bit position of 
their last [logy |+1 bits and 1 bit position of their initial 
llogz | bits, again making for a dilation of 2. Hence, we 
finally do indeed have a dilation 2 embedding! 
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Some results on graph coloring in parallel 
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and 
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Abstract — The problem of constructing parallel 
graph-coloring algorithms is studied. It has been shown 
recently (13, 17,21] that the problem of Brooks coloring 
of graphs is in NC. In this paper, it is shown that the 
decision version of one of the sequential algorithms for 
coloring graphs, that typically uses fewer colors than the 
Brooks coloring algorithm, is logspace-complete for P; 
therefore it is unlikely that this approach will yield an 
algorithm in NC. An algorithm that colors some graphs 
with fewer colors than Brooks coloring is also shown. 


1 Introduction 


The graph coloring problem is the problem of assigning 
colors to the vertices of a graph in such a way that no two 
adjacent vertices receive the same color. Graph coloring 
has been studied extensively by researchers in the past. 
It is very easy to show that any graph G can be colored 
using no more than A(G) + 1 colors, where A(G) is the 
maximum degree of G. But it is also known that usually 
one requires fewer colors. A classic theorem on coloring is 
the theorem of Brooks [4], which shows that if G satisfies 
certain conditions, we can get away with using one less 
color: 


Theorem 1 A simple graph G of mazimum degree A 
can be colored using at most A colors iff G is not an odd 
cycle and G does not contain a Ka+1, i.e. a complete 
subgraph on A+1 vertices. 


Other bounds on the chromatic number are known [6], 
one of which we will use subsequently. 

There has been considerable interest recently in con- 
structing parallel algorithms for graph problems. Of 
particular interest are algorithms that use polynomially 
many processors and take polylog time. We call such 
parallel algorithms good. Problems having good parallel 
algorithms are said to be in the class NC [7]. It is ob- 
vious that NC C P, where P is the class of languages 
recognizable by a (sequential) deterministic Turing ma- 
chine in polynomial time. It is not known whether the 
containment is proper. It is conjectured that the hardest 
problems in P are not in NC. 

Many problems exist for which there are simple se- 
quential algorithms, but these algorithms seem hard to 
parallelize [1, 16]. The problem of finding a maximal 
independent set in a graph is one such. This problem 
was solved in three very different ways [10, 18,20]. An- 
other example is the problem of coloring graphs using the 
method implied by Brooks’ theorem. Recently, Hajnal 
and Szemeredi [13] exhibited a good parallel algorithm 


“This author’s research is supported in part by NSF grant No. 
NCR 8706350. 
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to do this. There are also good parallel algorithms for 
five coloring planar graphs (9, 15]. 

In this paper we look at two different coloring meth- 
ods. The first is a sequential coloring scheme for general 
graphs that does better than Brooks coloring in most 
cases, but we show that it is very unlikely that a good 
parallel algorithm exists for it. We then define a re- 
stricted class of graphs and exhibit a good parallel col- 
oring algorithm for them. 


2 Preliminaries 


Our notation and terminology follows that of Chartrand 
and Lesniak [6]. A (simple) graph G = (V, E) is a set V 
of vertices and a set E of edges, which are unordered 
pairs of vertices. A multigraph is a graph in which mul- 
tiple edges are allowed between the same pair of ver- 
tices. For a graph G, we denote the degree of a vertex 
v by d(v), and the degree of v in a subgraph H of G by 
dy(v). We denote the maximum degree of G by A(G), 
the miminum degree of G by 6(G). and the chromatic 
number of G (i.e. the least number of colors needed to 
color the vertices of G) by x(G). 

It was shown by P. Hajnal and E. Szemeredi [13] 
and independently by Howard Karloff [17] and Naor and 
Karchmer [21] that Brooks’ coloring is in NC. But as 
noted above, better colorings exist for most graphs. One 
way to color certain graphs using fewer colors is by using 
what we call the c-indez of a graph. 


Definition 1 The c-index of a graph G, denoted p(G), 
ts the maximum, over all subgraphs H of G, of the min- 
imum degree 5(H). 

It is known [23] that any graph G can be colored using at 


most p(G) + 1 colors. Furthermore, this can be done in 
polynomial time, using a suitable ordering of the vertices. 


Definition 2 A minimum degree elimination sequence 
of a graph G ts an ordering v1,02,...,Un of the vertices 
of G such that, for every t, the verter v; is of mint- 
mum degree in the subgraph G; induced by the vertices 
{v;, Vit1,--- , Un}. 


U4 V1 


v2 U3 


Figure 1: An example minimum degree elimination se- 
quence for a graph 


Figure 1 illustrates a graph with four vertices, for 
which the order v1, v2, v3, v4 is a minimum degree elimi- 
nation sequence. Note that a graph does not necessarily 
have a unique minimum degree elimination sequence; for 
example, v3, v4, U1, v2 is also a minimum degree elimina- 
tion sequence for the graph of Figure 1. 


Theorem 2 If v1,...,0n 18 a minimum degree elimina- 
tion sequence for a graph G then there exists 1 such that 


p(@) = 6(G:). 


Proof Let H be a subgraph of G such that 6(H) is 
maximum. Given the minimum degree elimination se- 
quence, choose the least i such that H < Gj. Clearly, 
6(G;) > 6(H), because H is a subgraph of G,;; and 
6(H) > 6(G;), by maximality. The desired conclusion 
follows. 

| 


Theorem 3 For every graph G, the chromatic number 
x(G) < p(G) +1. 


Proof. Let v1,...,v, be a minimum degree elimina- 
tion sequence for G. Color the vertices in the order 
Un, Un—1,---,01 In a “greedy” fashion, as follows: assign 
to vp, an arbitrary color; for the other vertices, assign 
to v; a color that has not been assigned to any of its 
already-colored neighbors. It is clear from the definition 
of the minimum degree elimination sequence that when 
v; is being colored, no more than p(G) neighbors of v; 
have been colored, and so the theorem follows. 


Next, we briefly review the notions concerning log- 
space completeness. 


Definition 3 Let A and B be decision problems. Prob- 
lem A is said to be log-space reducible to problem B if 
there exists a function f, that can be computed by a de- 
terministic Turing machine in logarithmic space, such 
that for every instance w of problem A, w has an affir- 
mative answer whenever the instance f(w) of B has an 
affirmative answer. | 


Definition 4 A decision problem B is log space com- 
plete for P if. 


e Bisin P, and 


e for every AE P, A ts log-space reducible to B. 


Lemma 4 If B is log-space complete for P and B is log- 
space reductble to A € P then A ts log-space complete for 
PF. 


Lemma 5 If B is log space complete for P and B ts in 
NC then P= NC. 


For a more rigorous treatment of this material see [8, 14]. 
Other related results can be found in [5, 7]. 
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3 The construction 


In this section, we show that the problem of determin- 
ing the order of vertices in any minimum degree elimina- 
tion sequence of a graph and the problem of determin- 
ing the c-index p(G) of a graph are complete in P with 
respect to logspace reducibility. Our method is to trans- 
form the circuit value problem restricted to fanout two 
(CVP2), which is known to be log-space complete for P 
[11, 12, 19, 22], to these two problems. Howard Karloff 
pointed out recently that the results also follow using 
NC reductions from some results of Anderson and Mayr 
[2, 3]. 
Let us define formally the two problems of interest. 


Problem IT: 


Instance: Graph G,and two vertices u and v € V; and 
the property that u appears before v in some mini- 
mum degree elimination sequence iff it appears be- 
fore v in all minimum degree elimination sequences. 


Question: Does u appear before v in some minimum 
degree elimination sequence ? 


Problem CVP2: 


Instance: A Boolean circuit represented as a sequence 
B= (fi, fe,..-, fi, Bi,.-.,Bn) with the f,’s repre- 
senting inputs with value false and the B,’s repre- 
senting NOR gates. Each NOR gate has fanout at 
most 2. The circuit has no feedback, i.e. each gate 
B; is of the form —(B;, By) where j,k <1. Each of 
the f;s are an input to exactly one B;. The truth 
value of B is defined as the output truth value of 
By. 


Question: Is the value of B false ? 


Theorem 6 CVP2 is complete in P with respect to logspace 


reducibility /12, 22). 
i 


We transform CVP2 to the minimum degree elimina- 
tion sequence problem. Let B be an instance of CVP2. 
We construct a graph G = (V, E) and pick out two ver- 
tices u and v from V such that u follows v in a minimum 
degree elimination sequence of G iff B, is true. We con- 
struct a multigraph of maximum multiplicity 4 first, and 
then show how to convert this to a simple graph satisfy- 
ing the same predicates. The construction is made up of 
several “components.” (The term ‘component’ here does 
not mean ‘connected component’; it is used to denote a 
subgraph corresponding to a gate or input.) Each input 
f; has a component, each gate B; has 2 components, and 
there is a garbage collecting component. 

Notation. An edge (u,v) of multiplicity k is rep- 
resented as (u, v)x. 


Every node of the multigraph is a tuple. The first 
element of the tuple determines the type of component 
the node is in, the second element determines which gate 
or input value the node represents, and the remaining 


elements of the tuple identify particular nodes that make 
up the component. 


For each input f we have a node (INP,#). The con- 
stant INP at the first position tells us that this node cor- 
responds to an input in B, and the second value : tells 
us the gate B; to which it is input. The nodes in the 
multigraph corresponding to the inputs f of the circuit 
form the set Vinp. 


Next, consider a gate B; of the circuit, with inputs 
from gates B; and By, and outputs to gates B, and B,. 
Corresponding to B; there are 2 components (see Fig- 
ure 2): 


(1) The “true” component, consisting of the subgraph 
T; = (Vi;, Et;), where 


Vi, = {(T,#,IN), (T,t, OUT, s), (T,#, OUT, t)}, 
{((T,#, IN), (T,#, OUT, s))o, ((T,1,1N), 


(T,t, OUT, t))o}. 


The element T in the first position indicates that 
this node is in the true part, and the element : in 
the second position indicates that this component 
corresponds to the gate B;. The constants IN and 
OUT at the third place signify that these nodes 
refer to the inputs and the outputs of the gates. 
We will see later on that the OUT nodes of B; 
will be connected to the IN nodes of B, in some 
fashion. 


(2) The “false” component, consisting of the subgraph 
F; = (V;,, Ey,), where 


Ve, = {(F,4, IN), (F,¢, G), 


(F,#, OUT, s), (F,#, OUT, t)} 


Es, = {((F,#, IN), (F,4, G))s, 
((F,#,G), (F,#, OUT, s)), 
((F,#,G), (F, 4, OUT, ¢))}. 


The part of the construction that depends on which 
gates have what inputs and outputs is the collection of 
communication edges. The true component of each gate 
has edges to the false components of its inputs and out- 
puts. More precisely, suppose that the gate B; has out- 
puts connected to the inputs of gates B, and B, (there 
can be at most two, since we are considering the mod- 
ified circuit value problem). Then the communication 
edges between the subgraphs representing B,, B; and B; 
are: 

E.. = 


t 


{((T,#, OUT, s), (F, s,IN)), 
((T,#, OUT, t), (F,¢, IN)), 

((F,#, OUT, s), (T, 3, IN) )a4, 
((F,#,OUT,t), (T,¢, IN))a}. 
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(T, 7, OUT, ¢) (F, 7, OUT, ) (T, &, OUT, ¢)(F, k, OUT, #) 


Figure 2: the subgraph corresponding to gate B; 


The idea is to force the nodes to get into the mini- 
mum degree elimination sequence in an order that mod- 
els the flow of truth values through the Boolean circuit. 
We will force the node (T,1, IN) to get into the minimum 
degree elimination sequence iff the output value of gate 
B; is true. We will see that nodes with degree 3 or 4 get 
picked to be put into the minimum degree elimination 
sequence at each stage and we will force the ordering by 
saying that when one node (say (T,#,IN)) is picked, the 
other node ((F,#,IN)) must have degree 5 or more. 


To ensure that a node does not go into the minimum 
degree elimination sequence prematurely, we will need a 
garbage collection component, which we call GC. Cor- 
responding to every node n in T; and F; whose degree 
needs adjusting there is a node n! in GC which is con- 
nected by some number of edges to its image n. The 
number of edges vary from 1 to 4 depending on the re- 
quirement. In addition each node in GC is connected to 
every other node in GC. This ensures that nodes in GC 
succeed all other nodes in any minimum degree elimi- 
nation sequence. To understand the use of GC notice 
that for any ¢ (T,t,OUT, 7) has degree 3, but we want to 
ensure that (T,1,OUT, 7) gets into the minimum degree 
elimination sequence only after (T,7, IN), so we need to 
increase its degree to atleast 5. This we do by connecting 
it to its image in GC. This business of connecting arcs 
emanating from a node to its image in GC is referred to 
as garbage collection. A node in the garbage collecting 
component is identified by gc in the first place. 


Ey; a {({T,#, OUT, s), (gc, T,#, OUT, s))3, 
({T,2, OUT, é), (gc, T,#, OUT, t))3, 
((F,2, G), (gc, F; t, G))2}. 


There are just two more things we need consider. 


1. When input to some B; is false,then the communi- 
cation links between the inputs to the circuit and 
the input gates to the circuit are as follows 


Fes, = {({INP,¢), Ck; t, IN) )a, ((F, t, IN), (ge, 58; IN) )2} 


The nodes corresponding to inputs, i.e. nodes in 
Vinp, do not have a true part and hence the inputs 
are garbage collected. 


2. If a gate has one or zero outputs then the arcs 
emanating from the out nodes are again garbage 
collected. 


Now that we have described the connections, let us 
look at the intuitive ideas behind the constructions. 

The elimination sequence for the false part should 
look like (F,¢,IN), (F,7,G),(F,1,OUT,s) and (F,t,OUT 
,t). We note that when a node gets into the minimum 
degree elimination sequence the degree of its successor 
becomes 4 or less. Similarly the elimination sequence 
for the true part looks as (T,1,IN),(T,1,OUT,s) and 
(T,2, OUT, ¢). 


We need (T,#,IN) to get into the minimum degree 
elimination sequence before (F,7,1N) iff the output of B; 
is true. We see that if this does happen then (T,1, OUT, 7) 
gets into the minimum degree elimination sequence be- 
fore (F,#,OUT,7). We note the following two things : 


1. d(({F,t,IN)) becomes less than 5 iff one of (T,jJ, 
OUT,?#) and (T, k, OUT, t) get picked into the min- 
imum degree elimination sequence or that atleast 
one of B; and By, is true. 


2. d((T,#,IN)) becomes less than 5 iff both (T, 7, OUT, ¢) 


and (T,k,OUT,1) get picked into the minimum de- 
gree elimination sequence or that both B; and By 
are false. 


So we see the correspondence between the function of a 
NOR gate and the interconnection pattern. 

The graph is the union of all the edges and vertices 
mentioned above: 


G = (V,E) 
v= Vinp UVi; UVs, U Veco 
E = 


Ei, U Ey, U Ee; U Ey, U Ep, U Eeo 


A more formal proof of correctness of this construc- 

tion is presented in [1]. 

We next show how to convert the multigraph into a 
simple graph without altering any of the previous results. 


1. (a,b)2 is replaced by adding 6 other nodes cj, cz, 
C3, d1,d_,dg and the new edges We first form a Ke 
minus the edges (c1,c3) and (dj, ds) and now con- 
nect a to c, and cg and b to dy and ds. We note 
that for any of the new nodes the degree will be- 
come less than 5 tff one of a or b is removed. And it 
is seen that if one of a or b is removed the degree of 
the other can be decreased by 2 without increasing 
p(G) to 5. 


2. (a,b)4 is converted to multiplicity 2 by adding 2 
other nodes c and d and connecting using arcs of 
multiplicity 2, both a and b, to c and d and also 
connecting c and d. 


3. (a,b)3 is converted by a similar method only that 
the multiplicity of (a,c) and (6, d) is now one. 
We now show that the following problem is complete in 
P with respect to logspace reducibility. 


Problem I], : 
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Instance: A graph G. 
Question: Is p(G) < 4? 


The problem with the previous construction is that 
p was determined by the GC. We see that B, is true 
iff (T,n,IN) gets into the minimum degree elimination 
sequence before (F,n,IN). Infact before (F,n,IN) gets 
into the minimum degree elimination sequence we come 
across a subgraph with 6 = 5. We utilize this fact and 
change GC. For each arc going to a node in GC from 
the basic graph we will have a separate node. Call these 
set of nodes N, ( the subscript will be apparent in a 
moment). All nodes in N; are connected to atleast 1 
and atmost 2 other nodes in Nj in any arbitrary fashion. 
Let number of nodes in N; bel. Veg = NyUN2U 
-++ Noog,t]+2- We will call the subscript on N as levels. 
N2 also contains [| nodes as images of nodes in N; and 
are connected by arcs of multiplicity 4 to their images. 
Level 2 upto level [log/] + 2 looks like a binary tree, 
with nodes at level 2 forming the leaves except that 


e if a node has only one child then the multiplicity 
of that arc is 3. 


e all other arcs have multiplicity 2. 
e every level has atmost one node having one child. 


Figure 3 shows levels 1 and above of a garbage collect- 
ing component when the number of nodes at level 1 is 
five. Note that there are 5 levels. One other change is 
required. The arcs emanating from (T,n,OUT, +) are 
not garbage collected but connected to the root i.e. the 
node at the topmost level in GC. 


Figure 3: An example garbage collecting component 


Theorem 7 Output of B is true uff p= 4. 


Proof: If B is true then (T,n, IN) gets into the MDES(I) 
and then (T,n,OUT,*). Now, the root in GC will have 
degree 4 and hence can get into the MDES. Easy to see 
that for a node at level 7 if its parent has been removed . 
then its degree is less than 5. So all of GC can be re- 
moved keeping p 4, the rest of the graph follows suit. If 
on the other hand B were false then once (F,n, OUT, *) 
is put in the minimum degree elimination sequence the 
resultant graph has 6 = 5. 


4 An NC coloring algorithm 


So we see that improving the bound of Brook’s theorem 
may not be easy to do in NC. However there is hope in 
the case of graphs where there is disparity in the degree 
of nodes in the graph. The following discussion illus- 
trates our point. We consider the case when a node with 
large degree is connected to a large number of nodes of 
small degree. 


Theorem 8 Letq(G) be the minimum number such that 
a graph G does not have a Kyg)41 and each vertex v 
with degree > q(G) has at most q(G) neighbours that 
have degree > than q(G). Then G can be colored in NC 
time using at most q(G) colors. 


Proof The idea is to color the nodes with degree > q(G) 
first. The other nodes can be colored without increasing 
the number of colors since each of them is adjacent to at 
most q(G) — 1 nodes. More formally let I = {v : d(v) > 
q(G)}. Consider the graph G, induced by I. Since each 
node in J is is adjacent to atmost q(G) nodes of degree 
> q(G) the maximum degree of a node in G, is g(G). So 
G, can be colored using g(G) colors in NC time using 
Hajnal and Szemeredi’s algorithm. The rest of the graph 
can now be colored using procedure extend mentioned in 
Luby’s paper [20]. For details see [1]. 
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SUBGRAPH ISOMORPHISM FOR CONNECTED GRAPHS OF BOUNDED 
VALENCE AND BOUNDED SEPARATOR IS IN NC 
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581 83 Linkoping, Sweden 


Abstract: We present a parallel algorithm for sub- 
graph isomorphism restricted to a connected graph H of 
bounded valence and a connected graph G of bounded va- 
lence and bounded 0-1 weighted separator, i.e. a ”1/3 — 
2/3” separator for any assignment of 0-1 weights to vertices 
of G. Our algorithm runs in time O(log® n) using polyno- 
mial number of processors, i.e. it is an NC® algorithm. 


1. Introduction 


The subgraph isomorphism problem is to determine 
whether a graph can be imbedded in another graph, i.e. 
whether the former is isomorphic to a subgraph of the lat- 
ter. It is a fundamental graph problem with a variety of ap- 
plications in engineering sciences, organic chemistry, pat- 
tern recognition. For instance, if H is an n-vertex circuit 
and G is an n-vertex planar graph of valence 3, n € N, then 
determining whether H can be imbedded in G is equiva- 
lent to the NP-complete problem of determining whether 
a planar graph of valence 3 has a Hamiltonian circuit [3]. 
Thus, the subgraph isomorphism problem is NP-complete 
even if G and H range only over connected planar graphs 
of valence < 3. Subgraph isomorphism also remains NP- 
complete when the first input graph is a forest and the 
other input graph is a tree (see pp. 105 in (3]). 

The only known polynomial-time algorithms for subgraph 
isomorphism are those for trees [14,15], two-connected out- 
erplanar graphs [8], and two-connected series-parallel 
graphs [12]. Recently, it has been also shown that the 
problem of subgraph isomorphism for two-connected out- 
erplanar graphs is in the class NC [11] (i.e. can be solved 
in poly-log time using polynomial number of processors 
[2]), and that the subgraph isomorphism problems for trees 
and two-connected series-parallel graphs respectively are 
in the random class NC (see [4,10] and [12] respectively). 
Are there other non-trivial classes of graphs for which sub- 
graph isomorphism is in NC or at least can be solved in 
polynomial time? 

In [7], it was shown that subgraph isomorphism restricted 
to connected graphs of bounded valence and bounded sep- 
arator (the so called ”1/3 — 2/3” separator) can be solved 
sequentially in time n°(l°e"), In [9], the above result has 
been strengthened by showing that the above problem can 
be solved in parallel in poly-log time using nO(lee”) pro- 
cessors. 

In this paper, we strengthen these results from [7,9] by 
presenting an NC® algorithm for subgraph isomorphism 
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restricted to a connected graph H of bounded valence and 
a connected graph G of bounded valence and bounded 0-1 
weighted separator, ie. a ”1/3 — 2/3” separator for any 
assignment of 0-1 weights to vertices of G. The algorithm is 
based on a non-trivial double use of the weighted separator, 
similar to that for subgraph homeomorphism in [11,16]. 
Seymour and Robertson have recently shown that any 
proper sub-family of planar graphs closed under the minor 
operation, as well as any family of graphs of bounded tree 
width have bounded 0-1 weighted separator [16,17]. Thus, 
our result applies to the case where G is any graph in the 
above families restricted to connected graphs of bounded 
valence. In particular G can be a connected series-parallel 
graph (see [6]) of bounded valence, or more generally, a 
partial k-tree. As for H, it should be a connected graph of 
bounded valence. 

The remainder of the paper is divided into three sec- | 
tions. In Section 2 we introduce basic notions, definitions, 
and facts used in the succeeding sections. In Section 3 we 
present the algorithm and analyze its time complexity. 


2. Preliminaries 


We shall use standard set and graph theoretic notation 
and definitions (for instance, see [1,3]). Specifically, we 
assume the following set and graph conventions: 


1) Given a partial mapping a of T into U, dom(m) denotes 
the set of all elements of T on which 7 is defined. Next, 
given a subset T’ of T, 1(T’) denotes 

{m(e) | e € dom(m) NT"}. 

2) For a graph G, V(G) denotes its set of vertices. 

3) Given a subset V’ of V(G), G(V’) denotes the the sub- 
graph of G induced by V’. 

4) Given graphs G;, 7 = 1,...,k, ies G; (or, Gi U...U G, 
equivalently) denotes the graph G where 

V(G) = Us_, V(G;) and two vertices of G are adjacent if 
and only if there is 2, 1 <2< k, such that the vertices are 
adjacent in G;. 


For the definitions of the classes NC*, NC, the reader is 
referred to [2]. 

We shall consider the following restriction of the subgraph 
isomorphism problem. 


Definition 2.1: Let H, G be two graphs, and let 7 be a 
partial one-to-one mapping of V(H) into V(G). The z- 
imbedding problem for H and G is to decide whether there 
exists an isomorphism between H and a subgraph of G 


that is an extension of 7. Such an isomorphism is called a 
n-imbedding of H in G. 


Note that the problem of subgraph isomorphism can be 
expressed as the 7-imbedding problem where 7 is an empty 
mapping. We use dynamic programming to solve the z- 
imbedding problem for graphs H, G, where the cardinality 
of the domain of 7 is bounded and H, G are of bounded 
valence and bounded number of connected components, 
and G is of bounded weighted separator. 

For technical reasons, we define the concept of bounded 
weighted separator of a graph with 0 — 1 vertex weights ( 
see [13] ) through that of an m-separation of graph. It is 
left to the reader to verify that our definition is equivalent 
to a standard one. 


Definition 2.2: Let m be a positive integer. Let G be a 
graph and let W;,W2,...,Ws be subsets of V(G). Finally, 
let W be a subset of V(G) of cardinality not greater than 
m. The sequence (W,, W2,..., W,W) is an m-separation of 
a graph G if the removal of W from G disconnects G into 
connected components C;, 1 = 1,...,4, where V(C,;) = Wj. 
A graph G is said to have an m-separator if for any subset 
U of V(G) there is is an m-separation (W;, W9,..., 

W;, W) of G such that none of the intersections W; N U, 
t =1,...,k, has more than (2/3) | U | vertices. 


Remark 2.1: If G has an m-separator then any subgraph 
of G has also an m-separator. 

Hint: By the assignment of 0 weights to the vertices of G 
outside the subgraph one can extract the subgraph from 
G. 


The correctness of our NC divide-and-conquer algorithm 
will in part follow from the following technical lemma. Its 
proof, as easy but lengthy, is left to the reader. 


Lemma 2.1: Let H, G be graphs with m-separator. Next, 
let m be a partial mapping of V(H) into V(G). There 
is a m-imbedding of H in G if and only if for any m- 
separation (W1,...,Wx, W) of G, there are an m-separation 
(Vi,....Vi, V) of H, a partition P of {1,...,/}, a one-to-one 
mapping f of P into {1,...,&}, and one-to-one partial map- 
pings tg of V(H) into V(G), S € P, such that for Se P: 
1) H(U;es V; UV) can be 7g imbedded in 

G(W4(s) U W), 

2) dom() 1 (Uses Vs 
3) Ts (V) Cc W, 

4) mg is consistent with 7 and with each of the other map- 
pings mg, S’ € P. 


) C dom(mg), and V C dom(rs), 


3. The NC algorithm 


Here, we consider a new approach to the subgraph isomor- 
phism problem for connected graphs of bounded valence 
and bounded (weighted) separator. A naive method of 
guessing the vertices of the separator in the first graph 
and guessing their image in the second graph can lead to 
un-polynomial number of considered subgraphs, when ap- 
plied recursively [7,9]. However, we will be able to keep 
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the maximum number of vertices through which a recur- 
sive subgraph is connected to the rest of the graph con- 
stantly bounded using the weighted version of the separa- 
tor, following the general idea of Robertson and Seymour 
for subgraph homeomorphism (see [11,16]). 


Our parallel procedure for 7 imbedding uses as a subrou- 
tine the procedure SE P(F,U,6,m) returning 
m-separations of the input graph F whose components 
contain no more than b elements of the input subset U 
of V(F). The procedure SEP isa straight-forward gener- 
alization of the procedure 2S EP from [12]. 


procedure SEP(F,U,6,m) 

input: a graph F, a subset U of V(F), and positive integers 
b, m. 

output: the set of all m-separations (W1, We, wey Wa, W) of 
F. where for i = 1,...,k, |W;NU |< b. 


for for all subsets W of V(F) with at most m vertices of 
F do in parallel 
begin 
X «+ TRUE; 
find the connected components D,,...,.D, of the graph 
resulting from deleting W from F; 
for 1 = 1,...,k do in parallel 
if | (V(D;) NU) |> b then X + FALSE; 
if X then return (W;,We,..., Wk, W) 
end 


By a straight-forward generalization of Lemma 5.1 in [12], 
we have: 


Lemma 3.1: For a fixed m, the procedure SEP can be 
realized by an NC? algorithm. 

Sketch: It is sufficient to observe that the number of all 
subsets W of V(F) with at most m elements is is O(n™) 
and that the connected components D),...,D, can be con- 
structed by a concurrent read concurrent write parallel 
RAM with a polynomial number of processors in time 
O(log n) [18], and hence by NC? algorithm by [19]. y 


The main recursive procedure SJ for m-imbedding is as 
follows. 


procedure SI(H,T,G,U,m,7) 

Input: graphs H, G, each of valence < d, G with m- 
separator, a subset T of V(H), a subset U of V(G), and a 
one-to-one mapping 7 of T into U which is 

a sub-isomorphism between H(T) and G(U). 

Output: If there is a 7-imbedding of H in G then YES 


c «~ max{|U |, m} 

if | T |>| U | or | V(A) |>| V(G) | then go to E; 

if | V(H) |< 2c then 

begin 
decide whether there is a 7-imbedding of H in G 
by brute force, if so return YES; 
go to E; 

end 

if | U |> ec then 


begin 
Pick an m-separation (Wj,...,W;, W) returned by 
SEP(G,U, (2/3) | U |) 
for 1 = 1,...,4 do in parallel 
begin 
G; + G(W; UW); 
U;<« UnWw,;UW; 
for all (Vi,...,Vi, V) returned by 
SEP(H,T, (2/3) | U |) do in parallel 
for 7 = 1,2,...,/ do in parallel 
begin 
H; — H(V; UV); 
T; —TAV;UV; 
end 
for all subsets {T;, ,...,7;.} of {T1, To, ..., Ti} 
where | U5_, Tj, |<| Ui | do in parallel 
for all one-to-one mappings 7’ of eee T;,, into 
U; that are consistent with a and define a sub- 
isomorphism between G(U,-, 7;,) and G(U) 
do in parallel 
if SI(U,=1 H;,, aa T5, Gis m(U5=1 T;,),™m, 7’) 
then M((j1,...,3r),4) <1; 
Using the table M decide by Lemma 2.1 and brute 
force whether there exists a 7-imbedding of H 
into G mapping V into W, if so, return YES; 
end; 
end 
if | U |< 2c then 
begin 
Pick an m-separation (Wj,...,.Wi, W) of G returned 
by SEP(G,V(@), (2/3) | V(G) |) 
for 1 = 1,2,...,k do 
begin 
G; — G(W; UW); 
U;-UnNW;UW; 
for all (Vi,...,.Vi, V) returned by 
SEP(H,V, (2/3) | V(G) |) do in parallel 
for 7 = 1,2,...,/ do in parallel 
begin 
H; — H(V; UV); 
T; —TOV;UV; 
end 
for all subsets {T;,, ...,T;, } of {T1, To, ..., Ti} 
where | U1 T;, |<| U; | do in parallel 
for all one-to-one mappings 7’ of Up=1 T;, into 
U; that are consistent with 7 and define a sub- 
isomorphism between G(U5_, T;,) and G(U) 
do in parallel 
if SI(Up=1 H;,, Uo=1 Ti, ’ Gi; m(Up=1 T; )s mM, 1’) 
then M((ji,..-,Jr),t) <— 1; 
Using the table M decide by Lemma 2.1 and brute 
force whether there exists a 7-imbedding of H 
into G mapping V into W, if so, return YES; 
end; 
E:end 


To outline the proof of the correctness of SJ we shall use 
‘the following lemma. 
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Lemma 3.2: Procedure SI is correct. 

Sketch: Suppose first that | U |< 4c. Since G has an 
m-separator, SEP(G,V(G), (2/3) | V(G) |) returns at 
least one m-separation. Further, suppose that there is a 7- 
imbedding ¢ of H into G. Then, by Lemma 2.1, for the m- 
separation (Wj,....Wi, W) of G there is an m-separation 
(Vi,....Vk, V) of H returned by SE P(H,V(H),| V(#) |), 
and a partition P of {1,...,/} with a one-to-one mapping 
f of P into {1,...,k} such that for S € P, Ueg Hj can 
be mgs-imbedded in G(s) where 7g is a one-to-one map- 
ping of J jes 1; into U; consistent with 7, and mapping V 
into W. Note that the graphs H (U;- g H;) and G's) have 
m-separators by Remark 2.1. Therefore, they are valid pa- 
rameters for the recursive calls of SJ. Conversely, if such 
an m-separation of H and a one-to-one mapping f exist 
then there is a z-imbedding of H in G by Lemma 2.1. 
Suppose in turn that | U |> ee. The purpose of SI in 
this case is to reduce the original problem to imbedding 
subproblems where the subset U of V(G) is split appro- 
priately. Suppose that there is a 1-imbedding ¢ of H into 
G. Since G has an m-separator, SE P(G,U, (2/3) | U |) 
returns at least one m-separation. Further, suppose that 
there is a m-imbedding ¢ of H into G. Then, by Lemma 
2.1, for the m-separation (Wj, ...,W,, W) of G there is an 
m-separation (Vi,...,.Vi, V) of H returned by SEP(H,T, 
(2/3) | U |) and a one-to-one mapping f of a partition P 
of {1,...,/} into {1,...,4} such that: 

for S € P, Uscp Hs can be ms-imbedded in G sg) where 
Mg is a one-to-one mapping of |) sep /s into U; consistent 
with a and the other mappings 75-, S’ € P. Conversely, if 
at least one of the above conditions is satisfied then there 
is a m-imbedding of H in G.y 


Lemma 3.3: If H, G are connected and the maximum c 
of the parameter m and the cardinality of the parameter 
U is constantly bounded then the procedure SJ can be 
implemented by NC® circuits. 

Sketch: We may assume without loss of generality that 
| V(H) |<| V(G) | and | V(H) |> 4c. The thesis of the 
lemma follows from the following propositions: 

i) The subsets T and U have never more than 4c elements 
in any recursive call of SJ. 

ii) The recursion depth of SJ is logarithmic. 

iii) The body of SI can be implemented by NV C? circuits 
if we do not count the recursive calls and the calls of SEP. 
iv) SEP can be realized by an NC? algorithm. 


Before proving the above propositions, let us show how 
they yield the thesis of the lemma. By the definition of SJ, 
the propositions (ii), (iii) and (iv) ensure O(log* n) depth. 
By induction on the depth of a recursive call of SI, we 
easily observe that the subgraphs H’, G’ of H and G that 
are are parameters in the call are separated from the rest 
of the graph by vertices in the subsets parameters, say T’, 
and U’. Since T’, U' are of size O(c) by (i), and H, G are 
of valence < d, the number of connected components of H' 
and G’ is O(cd), ie. O(1). Combining the above fact with 
(i), we conclude that the number of all possible parameters 


in the direct recursive calls of SI(H,T,G,U,m, 7) is O(1). 
Hence, by (ii), the number of recursive calls of SJ on all 
recursion levels is polynomially bounded. This combined 
with (iii) and (iv) shows that we can compute all these 
calls using polynomial number of processors, keeping the 
O(log® n) depth. 

The proof of (i) is by induction on the recursive depth of a 
call SI. At the zero depth, the subsets U and T are of car- 
dinality bounded by c. Assume that (i) holds for the calls 
of SI at the recursive depth g. Let Q be a vertex set that 
is a parameter of one of the above calls of SJ. If Q has at 
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most ->c vertices than it can be expanded maximally by 


m vertices (W or V respectively) before further recursive 
calls. If Q@ has more than Pe and at most 3.5c (respec- 
tively, at most 4c) vertices then it is split into parts of at 
most 3c (respectively, ec) vertices that can be augmented 
by at most m vertices before further recursive calls. This 
completes the proof of (i). 

Let us prove (ii). If SJ did not contain the case | U |> 2c 
then it would be of depth proportional to the partition 
tree of G induced by the ”1/3 — 2/3” separator of G, i.e. 
logarithmic in the size of G. On the other hand, it takes 
at most two such calls to reduce the size of the subsets 
Q from 4c to at most Be by the proof of (i). Hence, if 
we follow a path in the tree of recursive calls of SJ then 
we never encounter more than two consecutive calls of SI 
dealing with the case | U |> 42c, which does not increase 


the size of the graph parameters. By the above arguments, 
we can conclude that SJ is of logarithmic depth. 

The proposition (iii) follows from the fact that the number 
of all possible m-separations (Vj,...,.Vi, V), (W1,...,We, 
W) is O(n™), and the number of all possible mappings 7’ 
considered in the body of SJ is constantly bounded by (i) 
and | = O(1) (see the part of the proof preceding the proof 
of (i)). Note also that the set intersections in the body of 
STI are computed only for finite sets. Thus, the body of SI 
without the invoked procedure calls can be implemented 
by a concurrent read exclusive write parallel RAM with 
a polynomial number of processors in logarithmic time. 
Hence, it can be implemented by NC? circuits by [19]. 
Finally, the proposition (iv) follows from Lemma 3.1. g 


Theorem 3.1: Let d and m be positive integer constants. 
Let H and G be connected graphs of valence < d. Assume 
that G has an m-separator. The problem of subgraph iso- 
morphism for such graphs H, G is in NC®. 
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OPTIMAL SORTING ON REDUCED ARCHITECTURES 


R. Cypher (*) 


ABSTRACT: This paper studies the problem of sorting N 
items on a P processor parallel machine, where N = P.. The 
central result of the > Paper is a ncw algorithm, called cubesort, 
that sorts N = P+! items in O(k P ne log P) time using a 
P processor shuffle-exchange. Thus for any positive constant 
k, cubesort provides an asymptotically optimal speed-up over 
sequential sorting. Cubesort also sorts N = EF log P items 
using a P. processor shuffle-exchange in O(log? P/loglog P) 
time. Both of these results are faster than any previously 
published algorithms for the given problems. Cubesort also 
provides asymptotically optimal sorting algorithms for a wide 
range of parallel computers, including the cube-connected cy- 
cles and the hypercube. An important extension of the central 
result is an algorithm that simulates a single step of a Priority- 
CRCW PRAM with N processors and N words of memory 
on a P processor shuffle-exchange machine in O(k pik log P) 
time, where N = P aol 


I. Introduction 


This paper presents a new parallel algorithm for sorting N 
items using P processors, where N 2 P. This new algorithm 
can be implemented efficiently on a wide range of parallel 
computers, including the hypercube, the shuffle-exchange and 
the cube-connected cycles. In particular, the algorithm runs 
in O(N log N)/P) time on any of the above architectures, 
provided N = p!*! for some positive constant k. This 1s 
the first sorting algorithm for any of the above architectures 
that obtains this performance. In addition, the sorting algo- 
rithm will be extended to obtain an efficient simulation of a 
Priority}CRCW PRAM using a hypercube, shuffle-exchange 
or cube-connected cycles. The remainder of this section reviews 
models of parallel computers and cxamines previous work in 
the ficld of parallel sorting. 

The models of parallel computers that will be used in this 
paper are the PRAM [5], the hypercube [10], the shuffle- 
exchange [10] and the cube-connected cycles [11]. These models 
operate in an SIMD mode, with all of the processors performing 
the same instruction at any given time. The PRAM is a shared 
memory model in which all processors can access a common 
memory in unit time. The Priority-CRCW PRAM allows 
multiple processors to read from or write to a single memory 
location simultaneously. In the case of simultancous writes to 
a single location, the lowest numbered processor attempting 
to write to that location succeeds. 

The hypercube, the shuffle-exchange and the cube-connected 
cycles consist of a set of processors, cach containing a local 
memory, that communicate with one another using a fixed 
interconnection network. In the hypercube, the P processors 
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are numbered 0 .. P-1 and processors i and j are connected i 
the binary representations of 1 and j differ in cxactly | bit 
position. In the shuffle-exchange, the P processors are num- 
bered 0 .. P-1 and processors i and j are connected if j = 
Shuffle(i, P), j = Unshuffle(i, P) or j = Exchange(i) where 
Shuffle, P) = 2i mod (P-1), Unshuffle(i, P) = 3 iff Shuffle(j, 
P) = i, and Exchange(i) = 1+1 - 2(i1 mod 2). The cube- 
ee ae contains P processors, where P = 2 and K 

= R + 2%. The processors are numbered with pairs (b, c) 
Gis b is a (K-R) bit number and c is an R bit number. 
Processor (b, c) 1s connected to processor (d, c) if b = d and 
c= et+l,ifb = dandc = ec-l, orifc = e and the binary 
representations of b and d differ in only the c-th bit position. 
The shuffle-exchange and the cube-connccted cycles are feasible 
models because each processor is connected to only a fixed 
number of other processors. 

One of the earliest results in parallel sorting was obtained 
by Batcher. In [3], Batcher presented the bitonic sorting algo- 
rithm. In [12], Stone showed that the bitonic sort could be 
implemented on a shuffle-exchange. This yiclds an O(log” N) 
time sort for N = P numbers on the shuffle-cexchange. 

In [4], Baudet and Stevenson show how any parallel algo- 
rithm for sorting N items with P = N processors that is bused 
on comparisons and exchanges can be used to obtain an al- 
gorithm for sorting N items with P < N processors. By 
applying their technique to the bitonic sort on the shullle: 
exchange, they obtained an O((N/P) log (N/P) + (N/P) log? 
P) time sorting algorithm when P < N. Their algorithm pro- 
vides an optimal speed-up over sequential comparison sorting 
only when P = O(284 08 ND), 

An algorithm for a special case of the sorting problem was 
given by Gottlicb and Kruskal [6]. They presented a shulfle- 
exchange algorithm for the permutation problem, where the 
N numbers to be sorted are in the range | through N and 
where each number appears exactly once. Their algorithm 
requires op? + (N/P) log P) time and gives optimal speed-up 
over sequential comparison sorting when P = O((N log N)?/?), 
In their paper, Gottlicb and Kruskal state that they do. not 
know of an optimal algorithm es the permutation problem 
when FP is not in O((N log N)?/ 2) The current paper thus 
improves upon Gottlicb and Kruskal’s result in two ways. 
First, the algorithm presented in this paper solves the gencral 
sorting problem rather than the permutation problem. Second, 
the aeon presented here gives optimal speed-up wae N 
= for any positive constant k, 

A breakthrough in parallel sorting was obtained ey Ajtai, 
Komlos and Szemeredi [2]. They created a network for sorting 
N items that consists of O(N log N) comparators and has 
O(log N) depth. This network was used by Leighton to create 
a feasible parallel machine that sorts in O(log N) time when 


= N [8]. 


(**) Computer Science Department IBM Almaden Research Ctr, San Jose, CA 95120 


Unfortunately, there are two serious difficulties with 
Leighton’s technique. I‘irst, the technique performs poorly 
for P < 10! In contrast, the algorithm presented in this 
paper has a much smaller constant of proportionality and is 
much more likely to be useful in practice. Second, Leighton’s 
network is not a standard network that has been shown to be 
useful for solving problems other than sorting. In contrast, 
the shuffle-exchange and the cube-connected cycles have been 
proven useful in solving a wide range of problems. 

Another important related result was obtained by Leighton. 
Leighton has recently shown that his algorithm called 
columnsort [8] aa be used to obtain an efficient algorithm for 
sorting N = pit+lk items on a P processor shuffle-exchange 
[9]. Efe obtains an O(k! pik log P) time algorithm, where T 
= J/loggl.5 (T is approximately 3.419). The algorithm is 
based on calling columnsort in a nested manner so inet the 
N items are sorted by repeatedly sorting groups of P! items 
each. I*urthermore, there is a possibility that the value of the 
exponent T can be reduced to Icss than | by using Leighton’s 
concept of closesorting [8],[9]. Finally, a similar result using 
columnsort was obtained by Aggarwal [1]. More research into 
the applications of columnsort is clearly needed. 

The paper is divided as follows. Section 2 presents an 
abstract description of the new sorting algorithm and proves 
its correctness. Section 3 shows how this sorting algorithm 
can be implemented efficiently on a number of parallel com- 
puters and it presents an algorithm for simulating a Priority- 
CRCW PRAM with a shuffle-cxchange computer. Throughout 
this paper, N will be the number of items to be sorted and P 
will be the number of processors available. 


2. Cubesort 


This section contains a description of a new parallel sorting 
algorithm that the authors call cubesort. The description of 
cubesort given in this section is independent of the architecture 
that is used to implement it. Cubcsort works by repeatedly 
partitioning the N items to be sorted into small groups and 
sorting these groups separately and in parallel. In particular, 
Iet N = M? where M and ID are integers. Each step of 
cubesort partitions the. M? items into either MP | groups of 
M items each or M? * groups of M? items each, and sorts 
the groups, in parallel. 

The M? items to be sorted can be viewed as occupying a 
1)-dimensional cube, where cach side of the cube is of length 
M. lIach location L in the cube has an address of the form 
/= (Lp, Lp-, ..., Ly), where (Lp, Lp-y, ..., Ly) 1s a D-digit 
base-M number and Lj; is the projection of location L along 
the 1-th dimension. This numbering of the locations in the 
cube corresponds to an ordering of the locations that will be 
called row-major order. Cubesort will sort the items in the 
cube into row-major order. 

In addition to viewing the items as forming a single D- 
dimensional cube, they can be viewed as forming a number of 
cubes of smaller dimension. A j-cube, where 0 <j s D,isa 
sct of M! items with basc-M addresses that differ only in the 
j least significant digits. That is, A = (Ap, Ap-j, ..., A,) and 
B = (Bp, Bp.-1, ..., By) are in the same j-cube if and only if 
A; = B; for alli, j+1 < i < D. Each j-cube, where 0 < j < 
1, 1s classified as being either even or odd. A j-cube, where 0 
< j s [D-l, is even if it contains a location L where Lj+1 
mod 2 = QO, and it is odd otherwise. The D-cube that contains 
all N items is defined to be even. 
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There are D different partitions, represented as P; where | 
< j < D, that are used by cubesort. A group in mantition P; 
consists of a set of items with base-M addresses that differ 
only in digits j and j-1. Note that cach group in P; contains 
M items, while each group in the remaining partitions contains 
M? items. 

Finally, it is sometimes useful to view the items in a I cube 
as forming a 2- dimensional array. A j- array, where 2 sj = 
D, is an M? x MI? array of the items in a j-cube, vine the 
items are placed in the array in row-major order. ‘Thus cach 
(j-2)-cube forms a row in a j-array, and cach (j-1)-cube forms 
a band of M consccutive rows in a j-array. Also, cach column 
in a j-array is a group in Pj. 

In order for cubesort to work correctly, it is assumed that 
D = 3 and that M 2 (D-1)(D-2). Cubesort makes use o 
two subroutines, Sort_Ascending and Sort_Mixed. The sub- 
routine Sort_Ascending(1) sorts the groups in partition P; in 
ascending row-major order. The subroutine Sort_Mixed(i, }) 
sorts the groups in partition Pj; that are in even i-cubes in 


ascending order, while it sort the groups in Pj that are in odd 


1-cubes in descending order. Cubcsort is called by first setting 
the global variables M and D and then calling Cubesort(1)). 
The pseudo-code of cubesort is given below. 


Cubesort(S) /* Abstract Descr 
iption of Cubesort */ 
integer S; 
{ 
if S = 3 then 
{ 
/* PHASE 1: */ 
Limit_Dirty_Cubes(S) ; 
/* PHASE 2; */ 
Sort_Mixed(S- S-1); 
/* PHASE 3: */ 
Merge_Dirty_Cubes(S, S); 
} 
else 
{ 
/* PHASE 1: */ 
Limit_Dirty_Cubes(S); 
Limit_Dirty_Cubes (S) ; 
/* PHASE 2: */ 
Cubesort(S-1); 
/* PHASE 3: */ 
Merge_Dirty_Cubes(S, S); 
} 
} 


Limit_Dirty_Cubes(S) 
integer S; 


if S > 2 then 
Limit_Dirty_Cubes (S-1); 
Sort_Ascending(S); 


Merge_Dirty_Cubes(S, T) 
integer S, T; 
{ 

Sort_Mixed(S, 


if T > 2 then 


fee 


Merge_Dirty_Cubes(S, T-2); 


The call Cubesort(S) sorts each even S-cube in ascending 
row-major order and each odd S-cube in descending row-major 
order. In order to prove that cubesort works correctly, it is 
necessary to use the zero-one principle [7], which states that 
“if a network with n input lines sorts all 2" sequences of 0's 
and 1’s into nondecreasing order, it will sort any arbitrary 
sequence of n numbers into nondecreasing order”. In keeping 
with the zero-one principle, the following discussion will assume 
that the input consists entirely of 0’s and 1’s. 

The following definitions will be needed in the proof of 
correctness. A set of items is dirty if it contains both 0’s and 
I’s, and it is clean otherwise. A sequence of 0’s and I’s is 
ascending if it 1s of the form o"P, where a,b 2 0, and it is 
descending if it is of the form 120°, where a,b = 0. A sequence 
is monotonic if it is either ascending or descending. A sequence 
of 0’s and I’s is bitonic if it is of the form 0*1°0° or of the 
form 170°1°, where a,b,c = 0. A j-array is cross-sorted if all 
of its rows are monotonic and if it has at most | ascending 
dirty row and at most | descending dirty row. A J-array 1s 
semi-sorted if all of its rows are bitonic and if it has at most 
1 dirty row. A j-array is block-sorted in ascending (descending ) 
‘order if it consists of A rows containing only 0’s (1’s), followed 
by B dirty rows, followed by C rows containing only I’s (0’s), 
where A,B,C 2 0. A j-cube is cross-sorted (or semi-sorted or 
block-sorted) if its corresponding j-array is cross-sorted (or 
semi-sorted or block-sorted). 

The correctness of Cubcsort(S) is established next. In order 
to save space, the proofs have been omitted. 

LIIMMA |: If a j-array originally has B dirty rows, and if 
the columns of the j-array are then sorted in ascending (de- 
scending) order, the resulting j-array will be block-sorted in 
ascending (descending) order and will contain no more than 
B dirty rows. 

LEMMA 2: After calling Limit_Dirty_Cubes(i), where i = 
2, there are at most i-I dirty (i-1)-cubes in each i-cube, and 
the dirty (i-1)-cubes are consecutive within each 1-array. 

LEMMA 3: If a j-array is originally semi-sorted or cross- 
sorted, and if the columns of the j-array are then sorted in 
ascending (descending) order, the resulting j-array will be semi- 
sorted and it will be block-sorted in ascending (descending) 
order. 

THEOREM 1: After calling Cubesort(3), each even 3-cube 
is sorted in ascending row-major order and each odd 3-cube 
is sorted in descending row-major order. 

LISMMA 4: When S > 3, after Phase | there are at most 
2 dirty (S-1)-cubes in cach S-cube, and these dirty cubes are 
adjacent to one another in the S-array. 

LEMMA 5: For any values of S and T, where 1 < T < S 
< 1), if originally each T-cube is either semi-sorted or cross- 
sorted, and if Merge Dirty_Cubes(S, T) is then called, the 
resulting T-cubes will all be sorted. Furthermore, the T-cubes 
that are in even S-cubes will be sorted in ascending order and 
the ‘I’-cubes that are in odd S-cubes will be sorted in descending 
order. 

THEOREM 2: Cubesort(S), where 3 < S < D, sorts each 
even S-cube in ascending order and each odd S-cube in de- 
scending order. 
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N) = 


3. Implementing Cubesort 


The cubesont algorithm given in el previous section sorts N 

= M? numbers by performing O( D? ) stages, where each stage 
consists of sorting, in parallel, groups containing O(M 2) items. 
This section will show how cubesort can be implemented on 
a varicty of parallel models. 

First, the implementation of cubcsort on a shuffle-cxchange 
will be presented. It will be assumed that there are N items 
to be sorted and that P processors are available, where N = 
P . The items to be sorted are stored in an be item array 
A, when A; is located in processor ] = floor(i/P ney, for 0 = 
i < N-l. In order to use the algorithm from the Eueus 
section, let D = 2k+2, let M = P Wak , and let location L 
(Lp, Lp.-1,..., Ly) in the D-dimensional éube correspond to Ay 

Before the groups of a partition are sorted, the data are 
rearranged so that cach group lics within a single processor. 
There are 2 permutations that are used to perform this rear- 
rangement, namely the M-Shuffle and the M-Unshuffle. The 
definition of the M-Shuffle of N items is that M-Shuffle(X, 
MX mod (N-1). The M-Unshuffle is the inverse of the 
M-Shuffle, so M-UnshufMfle(Z, N) = X iff M-Shuflle(X, N) = 
Ti 

The P!* items that are local to cach processor can be sorted 
in O((1/k) P ue log P) time. Also, the M-Shuffle and M- 
Unshuffle of the items to be sorted can cach be accomplished 
in O((1/k) P He log P) time. Because the shulfle- exchange im- 
plementation of cubesort consists of O(D? ) = Ok? ) applica- 
tions of sorts that are local to processors and O(k? ) applications 
of the M- Shutlle and M-Unshuffle routines, the entire algorithm 
requires O(kP uk log P) time. 

The implementations of cubesort on the hypercube and the 
cube-connected cycles are similar to the implementation on 
the shuffle-exchange. Because of space limitations, ony the 
result will be stated. Cubesort can be implemented in O(k plik 
log P) time on a hypercube or cube-connccted cycles. 

In the above discussion, it was assumed that N = P!*!, 
However, cubesort can also be used when the number of items 
per processor grows more slowly. In particular, when N = P 
log P cubesort yields an O(log? P/loglog P) time sorting algo- 
rithm for a P processor shuffle-exchange. Again, space limi- 
tations prevent including that algorithm. 

Finally, cubesort can be used to simulate a Priority-CRCW 
PRAM with a shuffle-exchange computer. Because of space 
limitations, only the result will be stated. A single operation 
of a Priority-CRCW PRAM with N processors and N memory 
locations can be implemented in O(kP / log P) time on a P 
processor shuffle-exchange, where N = pers 
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